Closed ollynowell closed 3 months ago
@ollynowell Thanks for your clear explanation, can you rebase with master?
Hey!
As camelot is dead, we try to build a maintained fork at pypdf_table_extraction
.
Do you want to open the PR against that branch so that we can merge your improvement?
Hey!
As camelot is dead, we try to build a maintained fork at
pypdf_table_extraction
.Do you want to open the PR against that branch so that we can merge your improvement?
Yes definitely - keen to finally get this fix!
Are there any steps I need to take to become a contributor to that project? I'm getting a permission denied error trying to push a branch to it.
PDFMiner text objects should be sorted before the row grouping algorithm, otherwise items that belong on the same row will not be correctly grouped together.
In
_generate_columns_and_rows
this is done correctly the first time: Line 328:t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))
But
inner_text
is not sorted at any point after being extended withouter_text
, which means that the _group_rows algorithm does not always work correctly - motivating example below:Motivating Example My table looks like this:
Initial column detection identified just three columns (because those columns being longer pushed the mode to three)
inner_text
was then populated by this column:and subsequently extended with the
outer_text
from here:Because
inner_text
isn't sorted, _group_rows first finds rows of length 1 from the oneinner_text
columns, and then finds longer rows from theouter_text
column. As a result theinner_text
column isn't added.The sort in this PR fixes the problem and seems a reasonable place to apply it to me, but I am not familiar with this codebase - I've only debugged this one case.