camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.95k stars 465 forks source link

Difficulties with Multi-line headers. Rows shifted down. #470

Open poetaster opened 9 months ago

poetaster commented 9 months ago

Describe the bug This pdf, https://poetaster.de/misc/118.pdf (which I'm not uploading here since it may be a copyright issue) is read well but camelot shifts the rows under the multi-header controllability, down.

Steps to reproduce the bug

Load the above file and try both stream and lattice reading. I tried a lot of variations:

stream with different row tolerances: dfs = camelot.read_pdf('118.pdf', flavor='stream', row_tol=20,flag_size=True)

and lattice with many scale and shift variations.

dfs = camelot.read_pdf('118.pdf', flavor='lattice', shift_text=['r','t', 'r', 't'], line_scale=20)

dataframe

Lattice appears to get it right:

camelot.plot(dfs[0], kind='grid').show()

lattice

Which seems correct. But it always shifts the rows in the controllability part.

Expected behavior

Rows should not be shifted.

Code

Began with:

import camelot
dfs = camelot.read_pdf('118.pdf') 

And tried many variation, most recent lattice being: dfs = camelot.read_pdf('118.pdf', flavor='lattice', shift_text=['r','t', 'r', 't'], line_scale=20)

PDF See above.

Screenshots See above.

Environment

Additional context

wolfassi123 commented 4 months ago

@poetaster Did you manage to find a way to fix the issues with multi-row headers?

paulobarrera14 commented 2 months ago

@poetaster Pinging this again, where you able to find a fix for the multi-row headers?

poetaster commented 2 months ago

I hadn't worked on this (ended up reading excel files directly for that project) since then. I've looked now, but thought i should probably update camelot? What version would be best to test with?

poetaster commented 2 months ago

I wasn't sure if I had done the original on my PC or on my jupyter lab server. On this pc, camelot is at 0.9.0 and the results are the same. camelot

table

poetaster commented 2 months ago

Ok, updated to 0.11.0 and same same. I'm not sure if it's just that I haven't understood the shifting 'foo', but even without, camelot get's the grid correct, but shifts the content of the 'controlability' columns 2 down.