conjuncts / gmft

Lightweight, performant, deep table extraction
MIT License
286 stars 18 forks source link

Multi-line row support in config? #26

Open LtSalt opened 3 days ago

LtSalt commented 3 days ago

If a pdf contains tables with multi-line rows, the table formatter will not recognize it and output multiple rows. I could not find a config option to deal with this. Am I overlooking something?

conjuncts commented 2 days ago

My best guess is that large_table_assumption is being triggered, overriding the deep structure analysis and splitting a row into multiple rows. Try increasing large_table_row_overlap_threshold. However, I might be better able to diagnose the problem with an example pdf.

LtSalt commented 2 days ago

Thank you. Setting the config override to force_large_table_assumption=False indeed changes the output - but now data is being assigned to the wrong row.

See the example pdf.

Input (table on first page):

image

Output without any config overrides:

image

Output with force_large_table_assumption=False:

image

Something similar happens here with a second example PDF from the same type of report. This time some rows have centered content. It looks like the confidence on recognizing these rows is too low, so it merges their content into the next row. I have tried changing total_overlap_reject_threshold but so far it didn't work.

EDIT: With the second example PDF, I was able to prevent row merging by slighlty increasing _nms_overlap_threshold.

EDIT again: Might be good enough! See results here.

conjuncts commented 3 hours ago

It's good to hear that you found a set of configuration settings that worked!

Let me know if there are any more issues.

LtSalt commented 3 hours ago

Unfortunately, there is one. In one test case the detection algo fails to separate two columns that are close together:

image

Downstream the model is unable to extract the columns correctly. I've tried enabling multi header but it didn't make a difference.

Is this something that might be remedied be tweaking the config? Or is the detection algo simply not good enough?