conjuncts / gmft

Lightweight, performant, deep table extraction
MIT License
283 stars 18 forks source link

Possibility to indent cells based on original document intentation #17

Open vivekrathiave opened 1 month ago

vivekrathiave commented 1 month ago

This is working great. You have accounted for a lot of scenarios. Thank you.

Quick question, Is it possible to indent values in output it the first column has indentations to depict hierarchy? image

conjuncts commented 1 month ago

Sorry, not at the moment. The only apparatus right now is to detect what what TATR calls "projected row header" (for example, here they would be Indication, Method, Level of 1st operator).

Its location has moved around in the past, but is currently available at FormattedTable._projecting_indices.

This is definitely a valuable feature to have, so I'll label it as an enhancement.

vivekrathiave commented 1 month ago

Thank you for the reply. I shall look at _projecting_indices to see if that can be used to infer the hierarchy,

vivekrathiave commented 3 weeks ago

Thanks , I did implement this and it came out good. Although just relying on projecting row wasn't enough as it was hit or miss. used bounding boxes separation as another measure to detect indentation.

On a side note, What is the character encoding of the output text? Some of the special characters are not emitted well in the output like +- .

conjuncts commented 3 weeks ago

Yeah sadly _projecting_indices is not always reliable, so it's good to hear that you could implement a workaround.

Regarding the character encoding: it should be any encoding supported by the pdf library (pypdfium2). I have successfully put through pdfs with the ± character and gotten tables with ±. But often the pdf itself will say that the "±" character is something else like "6" and "8". This error is pretty unavoidable since some pdfs will literally say there is a "6" at the bbox of the "±". (one way to check is to open the pdf, copy-n-paste the ±, and see what you get). And it's not pypdfium2's fault either because it's innate to the pdf. To address this one would have to turn to OCR. To speed it up I have been only OCRing certain crucial characters.

vivekrathiave commented 2 weeks ago

Thanks. How can I implement OCR only on certain crucial characters?

conjuncts commented 6 days ago

This is what I can provide, but there is a lot missing. The OCR part you will have to decide for yourself how to do it - for example using pymupdf and pytesseract -- and then getting x0 y0 x1 and y1 and putting it into the dataframe. (They should be in pdf units; no conversion factor). I had to use a custom resnet.

vivekrathiave commented 3 days ago

Thanks, this is very helpful. Will certainly look at this. However I have tried other parser , and it seems to preload lots of common pdf fonts and glyphs during parsing and it seems to resolve characters a lot better. (not 100% perfect though) . Will switch back to gmft and try this approach.

conjuncts commented 3 days ago

Maybe it's a pdf parser issue? Have you tried using pymupdf?