Enhanced Table Extraction for Complex Formats

AdBaWa commented 5 days ago

Requested feature

Enhanced table extraction for complex table formats. Currently, Docling is able to identify the values correctly, but formatting is sometimes misaligned or unclear, especially in tables with multi-line headers, merged cells, or specific symbols. This affects readability and usability of the output, particularly when dealing with scientific or technical tables with detailed data.

Examples:

In more complex tables (see Image), readability is affected due to missing alignment and merged values, making it hard to interpret the extracted content.

b611855b-27e1-44d2-a73b-01d804c9f798

2db878b1-09d2-47f6-9bd7-61ef220605db

Alternatives

Manual post-processing of extracted tables to correct alignment and formatting, which is time-consuming and counterproductive.
Using other OCR tools; however, this would mean adding another layer to the workflow and reducing efficiency.
Exploring other machine learning models for table recognition and extraction, but GPT-4 Vision might offer a more advanced, integrated solution, potentially focusing on correcting column alignment without requiring extensive model training.

maxmnemonic commented 5 days ago

@AdBaWa, can you please try converting your tables with this option: TableFormerMode.ACCURATE as described here: control-pdf-table-extraction-options

This is to use the version of our TableFormer that has more layers / parameters, and it might catch the nuances.

AdBaWa commented 5 days ago

@maxmnemonic I tried it out, but it didn't catch the nuances. Result:

maxmnemonic commented 5 days ago

I see, the header is misaligned with content of the table (text of a header from one column is above the content of another column). Thanks for the input, we have to think if we can introduce some of the distortions like these to the synthetic training data to increase model robustness in the future.

PeterStaar-IBM commented 2 days ago

We need to first leverage the word-level bounding box together with the accurate tableformer.

depends on #285

DS4SD / docling

Enhanced Table Extraction for Complex Formats #280

Requested feature

Alternatives