atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.61k stars 349 forks source link

Double line Break - Camelot switches characters around #445

Open shakirshakeelzargar opened 3 years ago

shakirshakeelzargar commented 3 years ago

I'm trying to parse tables in a PDF using Camelot. The cells have multiple lines of texts in them, and some have an empty line separating portions of the text:

First line
Second line

Third line

I would expect this to be parsed as First line\nSecond line\n\nThird line (notice the double line breaks), but I get this instead: T\nFirst line\nSecond line\nhird line. The first character after a double-line-break moves to the beginning of the text, and I only get a single line-break instead.

I also tried using tabula, but that one messes up de entire table (data-frame actually) when there is an empty row in the table, and also in case of some words it puts a space between the characters.

ashir3097 commented 3 years ago

For those looking for a solution, I have found a workaround that works excellent. I have posted my solution here : https://stackoverflow.com/questions/64317363/camelot-switches-characters-around/64946264#64946264

@vinayak-mehta Any updates on this issue???

mssnglnk commented 2 years ago

Maybe you can solve this issue by reducing the value of the LAParams(char_margin=default 2.0) parameter.

You can set the parameters yourself with

camelot.read_pdf(DOCUMENT, pages="all", layout_kwargs{"char_margin": 0.5})

for example. Maybe some other parameters have to be changed. But char_margin is here the first I have in mind.