atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.61k stars 349 forks source link

Camelot python PDF table extraction - Bold text parsing issue #401

Open snehashimpi opened 4 years ago

snehashimpi commented 4 years ago

Hi, While PDF table extraction using camelot python, if there is bold text in PDF table, its coming multiple times in JSON object. Can not figure out why is this ? Is there any parameter we can set which extracts PDF table without any text formatting ?

anakin87 commented 4 years ago

Please post an example...

snehashimpi commented 4 years ago

Due to security , I can not post example. The problem is, the bold text in pdf table gets repeated. for example, if text is = 'Project test' , after parsing it is like 'Project test test' or 'Project test\rtest', etc.

alessandra3265 commented 3 years ago

Same problem

thusithaC commented 3 years ago

Came across the same issue.

PDF: image

Extracted DF:

image

Code: tables = camelot.read_pdf('foo.pdf', pages='5', flavor = 'stream')

Linux-5.4.0-47-generic-x86_64-with-glibc2.29 Python 3.8.2 (default, Jul 16 2020, 14:00:26) [GCC 9.3.0] NumPy 1.18.3 OpenCV 4.4.0 Camelot 0.8.2

Please update if there is a workaround. Thanks,