DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
10.48k stars 507 forks source link

Table representation misaligned between PDF and DOCX #382

Open vagenas opened 2 days ago

vagenas commented 2 days ago

Bug

The table representation appears misaligned between PDF and DOCX (depending on which one needs alignment, perhaps further formats are affected too).

Steps to reproduce

The snippet below uses the attached minimal example docs table.pdf and table.docx. The PDF is exported to a dataframe with explicit column headers, while for the DOCX the column headers are in the first normal row.

If the table representation within TableItem was the same, export_to_dataframe() would be the same too.

from docling.document_converter import DocumentConverter

def check_table(file_path):
    converter = DocumentConverter()
    doc = converter.convert(file_path).document
    table_item = next(doc.iterate_items())[0]
    print(table_item.export_to_dataframe())

check_table("table.pdf")
# >    Year Revenue Income Employees
# > 0  2014    92.7   12.0   379,592
# > 1  2015    81.7   13.1   377,757
# > 2  2016    79.9   11.8   380,300

check_table("table.docx")
# >       0        1       2          3
# > 0  Year  Revenue  Income  Employees
# > 1  2014     92.7    12.0    379,592
# > 2  2015     81.7    13.1    377,757
# > 3  2016     79.9    11.8    380,300

Docling version

Docling version: 2.6.0 Docling Core version: 2.4.0 Docling IBM Models version: 2.0.4 Docling Parse version: 2.0.4

Python version

Python 3.12.7

PeterStaar-IBM commented 1 day ago

@vagenas I think this could be because of header identification (not 💯 sure, but this would be my first guess). I think that the DOCX does not do any header identifcation, while pdf does.

vagenas commented 1 day ago

Indeed, at the moment col_header is explicitly set to False: https://github.com/DS4SD/docling/blob/eb64f6d368c5a13179b527ef0d755682c63b9b21/docling/backend/msword_backend.py#L481