Issue with Extracting Tables with Merged Rows

MahmoudAtef999 commented 3 weeks ago

Hello,

I’m encountering an issue when extracting tables containing merged rows. Specifically, when a cell spans multiple rows, the expected behavior is to assign it a row_span value greater than 1. However, in many cases, the extraction process fails to identify the correct row_span value, often assigning a lower value than the actual span. This results in blank cells appearing in the subsequent rows rather than merging as intended.

To address this, I tested with both do_cell_matching=True and do_cell_matching=False settings, and tried using both the DoclingParseDocumentBackend and DoclingParseV2DocumentBackend options. Unfortunately, neither approach yielded the correct row_span values or resolved the merging issue.

Attached are the following files for reference:

Sample PDF document with merged rows
Extracted output demonstrating the issue
Expected output showing the correct row_span values and row merges that Docling was unable to achieve

Attachments sample.pdf

extraction_output.csv

expected_output.csv

Thank you very much for your efforts on this project.

DucHungGithub commented 2 weeks ago

me too

cau-git commented 2 weeks ago

@MahmoudAtef999 thanks, I can reproduce this issue and will investigate further. The expectation should be that row spans are detected correctly here.

On a sidenote, the source of truth is the representation in DoclingDocument (or JSON), which you receive with the export_to_dict() method.

MahmoudAtef4499 commented 1 week ago

@cau-git Thanks for your response. I've used DoclingDocument to extract the tables and converted them to both CSV and HTML formats. I also tried converting the entire file to JSON. However, the issue persists in both cases. I would appreciate any further guidance or steps I may have missed in troubleshooting.

DS4SD / docling

Issue with Extracting Tables with Merged Rows #207