atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

Some columns get merged #276

Closed alcacer0 closed 5 years ago

alcacer0 commented 5 years ago

Hello, very nice package.

I'm currently having an issue with some columns. I get their values merged. The source pdf is this one

After performing pages = camelot.read_pdf(OUTPUT_PDF_FILE, flavor = 'stream', pages = 'all') The resulting data of the table first row contains the following values. (Merged values: ENL. M-40 + L.P BURGOS-ALAVA + 247,49)

['A-1', '12+00690', '336+01030', 'ENL. M-40\nL.P. BURGOS-ÁLAVA\n 247,49', '', '247,49', '', ''],

If I plot the detection results I get the following: limits

I think the problem is due to some long values in colums "Inicio" and "Final" (as shown in the plot) causing the first row (and the others) to merge this columns.

The workaround I'm doing is splitting by the \n character. But this doesn't work in page 4, where I get the data in a different order for some reason:

['ML-204', '0+00000', 'INT. ML-105\nINT. ML-300\n1+00190\n 1,19', '', '', '', '1,19']

(1+00190 is the value of the second column, but now appears after INT. ML-300)

Any suggestions?

Thank you!

vinayak-mehta commented 5 years ago

Have you tried adding separators and splitting text along them? https://camelot-py.readthedocs.io/en/master/user/advanced.html#split-text-along-separators

vinayak-mehta commented 5 years ago

Closed due to inactivity.

Satya23111985 commented 1 year ago

Hi, I have similar issue, tried separator and split text option but didn't work. any help?