DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
8.76k stars 417 forks source link

Enhanced Table Extraction for Complex Formats #280

Open AdBaWa opened 5 days ago

AdBaWa commented 5 days ago

Requested feature

Enhanced table extraction for complex table formats. Currently, Docling is able to identify the values correctly, but formatting is sometimes misaligned or unclear, especially in tables with multi-line headers, merged cells, or specific symbols. This affects readability and usability of the output, particularly when dealing with scientific or technical tables with detailed data.

Examples:

b611855b-27e1-44d2-a73b-01d804c9f798

2db878b1-09d2-47f6-9bd7-61ef220605db

Alternatives

  1. Manual post-processing of extracted tables to correct alignment and formatting, which is time-consuming and counterproductive.
  2. Using other OCR tools; however, this would mean adding another layer to the workflow and reducing efficiency.
  3. Exploring other machine learning models for table recognition and extraction, but GPT-4 Vision might offer a more advanced, integrated solution, potentially focusing on correcting column alignment without requiring extensive model training.
maxmnemonic commented 5 days ago

@AdBaWa, can you please try converting your tables with this option: TableFormerMode.ACCURATE as described here: control-pdf-table-extraction-options

This is to use the version of our TableFormer that has more layers / parameters, and it might catch the nuances.

AdBaWa commented 5 days ago

@maxmnemonic I tried it out, but it didn't catch the nuances. Result: image

maxmnemonic commented 5 days ago

I see, the header is misaligned with content of the table (text of a header from one column is above the content of another column). Thanks for the input, we have to think if we can introduce some of the distortions like these to the synthetic training data to increase model robustness in the future.

PeterStaar-IBM commented 2 days ago

We need to first leverage the word-level bounding box together with the accurate tableformer.

depends on #285