Open vagenas opened 2 days ago
@vagenas I think this could be because of header identification (not 💯 sure, but this would be my first guess). I think that the DOCX does not do any header identifcation, while pdf does.
Indeed, at the moment col_header
is explicitly set to False
:
https://github.com/DS4SD/docling/blob/eb64f6d368c5a13179b527ef0d755682c63b9b21/docling/backend/msword_backend.py#L481
Bug
The table representation appears misaligned between PDF and DOCX (depending on which one needs alignment, perhaps further formats are affected too).
Steps to reproduce
The snippet below uses the attached minimal example docs table.pdf and table.docx. The PDF is exported to a dataframe with explicit column headers, while for the DOCX the column headers are in the first normal row.
If the table representation within
TableItem
was the same,export_to_dataframe()
would be the same too.Docling version
Docling version: 2.6.0 Docling Core version: 2.4.0 Docling IBM Models version: 2.0.4 Docling Parse version: 2.0.4
Python version
Python 3.12.7