DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
9.83k stars 468 forks source link

Missing Text Inside Tables When Converting from DOCX to Markdown #291

Closed VitoFe closed 6 days ago

VitoFe commented 1 week ago

Bug

I've encountered an issue with Docling version 2.4.2 while converting a DOCX file to Markdown. It seems that the conversion process is missing text inside tables.

Command Used:

docling example.docx --from docx --to md --table-mode accurate

Steps to reproduce

  1. Use the attached example DOCX file.
  2. Run the above command to convert the DOCX file to Markdown.
  3. Open the resulting Markdown file and check the tables.

Expected Behavior: The text inside the tables should be accurately converted and included in the Markdown output.

Actual Behavior: The text inside the tables is missing in the Markdown output.

Attachments:

I appreciate your assistance in resolving this issue. Thank you!

Docling version

Docling version: 2.4.2 Docling Core version: 2.3.1 Docling IBM Models version: 2.0.3 Docling Parse version: 2.0.3 ...

Python version

Python 3.12.7 ...

maxmnemonic commented 1 week ago

@VitoFe I checked the issue, and in fact docx backend properly understands and populates Docling document with all these tables, which is a good news, so if you export in JSON or YAML, tables and their text is there. Bug is during the export process into Markdown, I'm working on the fix.

maxmnemonic commented 6 days ago

PR that addresses the issue: https://github.com/DS4SD/docling/pull/314

maxmnemonic commented 6 days ago

Fixed!