Bugfix for Stream._group_rows

ollynowell commented 1 year ago

PDFMiner text objects should be sorted before the row grouping algorithm, otherwise items that belong on the same row will not be correctly grouped together.

In _generate_columns_and_rows this is done correctly the first time: Line 328: t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))

But inner_text is not sorted at any point after being extended with outer_text, which means that the _group_rows algorithm does not always work correctly - motivating example below:

Motivating Example My table looks like this:

Initial column detection identified just three columns (because those columns being longer pushed the mode to three)

inner_text was then populated by this column:

and subsequently extended with the outer_text from here:

Because inner_text isn't sorted, _group_rows first finds rows of length 1 from the one inner_text columns, and then finds longer rows from the outer_text column. As a result the inner_text column isn't added.

The sort in this PR fixes the problem and seems a reasonable place to apply it to me, but I am not familiar with this codebase - I've only debugged this one case.

foarsitter commented 1 year ago

@ollynowell Thanks for your clear explanation, can you rebase with master?

MartinThoma commented 7 months ago

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that branch so that we can merge your improvement?

ollynowell commented 6 months ago

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that branch so that we can merge your improvement?

Yes definitely - keen to finally get this fix!

Are there any steps I need to take to become a contributor to that project? I'm getting a permission denied error trying to push a branch to it.

camelot-dev / camelot

Bugfix for Stream._group_rows #374