Open MahmoudAtef999 opened 3 weeks ago
me too
@MahmoudAtef999 thanks, I can reproduce this issue and will investigate further. The expectation should be that row spans are detected correctly here.
On a sidenote, the source of truth is the representation in DoclingDocument
(or JSON), which you receive with the export_to_dict()
method.
@cau-git Thanks for your response. I've used DoclingDocument to extract the tables and converted them to both CSV and HTML formats. I also tried converting the entire file to JSON. However, the issue persists in both cases. I would appreciate any further guidance or steps I may have missed in troubleshooting.
Hello,
I’m encountering an issue when extracting tables containing merged rows. Specifically, when a cell spans multiple rows, the expected behavior is to assign it a
row_span
value greater than 1. However, in many cases, the extraction process fails to identify the correctrow_span
value, often assigning a lower value than the actual span. This results in blank cells appearing in the subsequent rows rather than merging as intended.To address this, I tested with both
do_cell_matching=True
anddo_cell_matching=False
settings, and tried using both theDoclingParseDocumentBackend
andDoclingParseV2DocumentBackend
options. Unfortunately, neither approach yielded the correctrow_span
values or resolved the merging issue.Attached are the following files for reference:
row_span
values and row merges that Docling was unable to achieveAttachments sample.pdf
extraction_output.csv
expected_output.csv
Thank you very much for your efforts on this project.