Open sciencecw opened 2 months ago
It seems that null is returned for each cell of the table, and the cause is either odd underlying text object, encoding issue, or simply there is no text data.
Is there any way to switch to OCR for parsing?
Sorry, gmft doesn't currently have in-built support for OCR. You can export it to image via table.image()
. I'm also aware of this huggingface space but your doc may be transmitted over the internet.
Edit: you could also try a method of making text highlightable; ie. OCRmyPDF or this pymupdf discussion
The table is a simple one spanning the whole page, so far all the bounding boxes look alright, but
_df
is an empty dataframe.Unfortunately I cannot share the document. do you have any suggestions on how to go about debugging, or what parameters to tweak