Bugfix - Stream._group_rows

PDFMiner text objects should be sorted before the row grouping algorithm, otherwise items that belong on the same row will not be correctly grouped together.

In _generate_columns_and_rows this is done correctly the first time: Line 328: t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))

But inner_text is not sorted at any point after being extended with outer_text, which means that the _group_rows algorithm does not always work correctly - motivating example below:

Motivating Example My table looks like this:

Initial column detection identified just three columns (because those columns being longer pushed the mode to three)

inner_text was then populated by this column:

and subsequently extended with the outer_text from here:

Because inner_text isn't sorted, _group_rows first finds rows of length 1 from the one inner_text columns, and then finds longer rows from the outer_text column. As a result the inner_text column isn't added.

The sort in this PR fixes the problem and seems a reasonable place to apply it to me, but I am not familiar with this codebase - I've only debugged this one case.

camelot-dev / camelot

Bugfix - Stream._group_rows #375