conjuncts / gmft

Lightweight, performant, deep table extraction
MIT License
347 stars 23 forks source link

problems facing in gmft #18

Open dev-choudhary-gokloud opened 2 months ago

dev-choudhary-gokloud commented 2 months ago

1) does gmft contains any function set_cropbox similar to present in similar to present in pymupdf. 2) does gmft has functions which can read pdf and seprate non tabular data from tabular data like pymupdf does. 3) how can we get table context while we are detetcing table and converting it to csv . 4) how can i fix extraction problem in complex tables conversion of pdf to csv . below attached 5) how can we merge table extended to second page all together in one csv and if found new table then create another csv.

Screenshot 2024-08-30 at 1 42 25 PM Screenshot 2024-08-28 at 7 24 29 PM
conjuncts commented 2 months ago
  1. 9 might be relevant

  2. If you only need tabular data, then the usual workflow should work - refer to the quickstart notebooks.
  3. If you need both tabular and nontabular data formatted together, then that is a longstanding enhancement, see #12.
  4. I will take a look at it, but unfortunately complex merged cells aren't supported at this moment.
  5. The tables will be provided as separate dataframes so you'll need to write a way to merge several of them. Since tables may vary a lot in terms of header contents I don't anticipate writing a default function, and a customized approach will be needed

Since the tables appear to have explicit (solid black) cell boundaries, camelot/img2table might be worth a shot.