Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 574 forks source link

feat/merge_tables_on_different pages #3198

Closed tanzeel291994 closed 13 hours ago

tanzeel291994 commented 2 weeks ago

Is your feature request related to a problem? Please describe. Having a continuous table on different pages, partition_pdf identifies it as 2 different tables and disconnects the relation between the two tables on different pages Describe the solution you'd like It should provide a single table element even when the table is broken on different pages in a pdf Describe alternatives you've considered In supplement_element_with_table_extraction(ocr.py) we could check the continuation logic and merge the tables extracted by "microsoft/table-transformer-structure-recognition" Additional context This feature is also on not the Unstructured Free API as well,

MthwRobinson commented 2 weeks ago

Hi @tanzeel291994 - Thanks for the suggestion. Table detection is a page level operation and as of now we don't plan to add support merging tables across pages. However, if you'd like to contribute a utility function to help with that we'd be happy to review.