conjuncts / gmft

Lightweight, performant, deep table extraction
MIT License
274 stars 18 forks source link

bounding boxes for columns and rows detected, but empty dataframe is returned #11

Open sciencecw opened 2 months ago

sciencecw commented 2 months ago
from gmft import AutoTableFormatter, TATRFormatConfig, TATRTableFormatter
config = TATRFormatConfig()
config.total_overlap_reject_threshold = 0.5
formatter = TATRTableFormatter(config = config)
ft = formatter.extract(tables[0])
ft.visualize() # detected all rows and columns

The table is a simple one spanning the whole page, so far all the bounding boxes look alright, but _df is an empty dataframe.

Unfortunately I cannot share the document. do you have any suggestions on how to go about debugging, or what parameters to tweak

sciencecw commented 2 months ago

It seems that null is returned for each cell of the table, and the cause is either odd underlying text object, encoding issue, or simply there is no text data.

Is there any way to switch to OCR for parsing?

conjuncts commented 2 months ago

Sorry, gmft doesn't currently have in-built support for OCR. You can export it to image via table.image(). I'm also aware of this huggingface space but your doc may be transmitted over the internet.

Edit: you could also try a method of making text highlightable; ie. OCRmyPDF or this pymupdf discussion