conjuncts / gmft

Lightweight, performant, deep table extraction
MIT License
274 stars 18 forks source link

A way to specify column positions/coordinates? #25

Open Sumbul-Ze opened 5 days ago

Sumbul-Ze commented 5 days ago

I am using gmft to extract multipage tables that do not have explicit lines between columns. A header is there only on the first page, on the basis on which gmft correctly produces the correct table. In subsequent pages, there are some columns that are all empty causing the model to skip columns, or merge columns if i set the overlap reject threshold too high. To solve this, I would like to extract column positions using the first page and pass the coordinates to the model the subsequent pages. Is there a way to do this in gmft?

conjuncts commented 1 day ago

That is a good question. Currently, it is possible but there it is not super streamlined. I suggest that you could iterate through TATRFormattedTable.fctn_results and remove the columns (with label == 1). Then, add the columns with known bboxes back in as necessary. Then, calling df() should reflect the new changes.

I wrote some brand new docs for this - it is now at https://gmft.readthedocs.io/en/latest/advanced.html.