camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.96k stars 466 forks source link

Bugfix - Stream._group_rows #375

Closed ollynowell closed 1 year ago

ollynowell commented 1 year ago

PDFMiner text objects should be sorted before the row grouping algorithm, otherwise items that belong on the same row will not be correctly grouped together.

In _generate_columns_and_rows this is done correctly the first time: Line 328: t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))

But inner_text is not sorted at any point after being extended with outer_text, which means that the _group_rows algorithm does not always work correctly - motivating example below:

Motivating Example My table looks like this: image

Initial column detection identified just three columns (because those columns being longer pushed the mode to three) image

inner_text was then populated by this column: image

and subsequently extended with the outer_text from here: image

Because inner_text isn't sorted, _group_rows first finds rows of length 1 from the one inner_text columns, and then finds longer rows from the outer_text column. As a result the inner_text column isn't added.

The sort in this PR fixes the problem and seems a reasonable place to apply it to me, but I am not familiar with this codebase - I've only debugged this one case.