PDFMiner text objects should be sorted before the row grouping algorithm, otherwise items that belong on the same row will not be correctly grouped together.
In _generate_columns_and_rows this is done correctly the first time:
Line 328: t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))
But inner_text is not sorted at any point after being extended with outer_text, which means that the _group_rows algorithm does not always work correctly - motivating example below:
Motivating Example
My table looks like this:
Initial column detection identified just three columns (because those columns being longer pushed the mode to three)
inner_text was then populated by this column:
and subsequently extended with the outer_text from here:
Because inner_text isn't sorted, _group_rows first finds rows of length 1 from the one inner_text columns, and then finds longer rows from the outer_text column.
As a result the inner_text column isn't added.
The sort in this PR fixes the problem and seems a reasonable place to apply it to me, but I am not familiar with this codebase - I've only debugged this one case.
PDFMiner text objects should be sorted before the row grouping algorithm, otherwise items that belong on the same row will not be correctly grouped together.
In
_generate_columns_and_rows
this is done correctly the first time: Line 328:t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))
But
inner_text
is not sorted at any point after being extended withouter_text
, which means that the_group_rows
algorithm does not always work correctly - motivating example below:Motivating Example My table looks like this:
Initial column detection identified just three columns (because those columns being longer pushed the mode to three)
inner_text
was then populated by this column:and subsequently extended with the
outer_text
from here:Because
inner_text
isn't sorted,_group_rows
first finds rows of length 1 from the oneinner_text
columns, and then finds longer rows from theouter_text
column. As a result theinner_text
column isn't added.The sort in this PR fixes the problem and seems a reasonable place to apply it to me, but I am not familiar with this codebase - I've only debugged this one case.