atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.61k stars 349 forks source link

Extracted tables give encoded text. #450

Open cesarPano opened 3 years ago

cesarPano commented 3 years ago

First of all, thanks for your lib. It helps me a lot in everyday's work.

I have a problem with a daily pdf report. Some days camelot works properly and gives 'good text'. Others, it gives a good table structure, but the text inside is somehow 'encoded'.

I have realized that meta-data are very similar, but when it says:

         pdf:Producer='GPL Ghostscript 9.05'/>         <--- ¡ Good extraction !
         pdf:Producer='GPL Ghostscript 9.07'/>         <--- ¡ Bad extraction !

geex

I have checked this out with about 30 or 40 different PDFs.

¿ Have you got some idea on how to solve the problem ?

I use Python 3.8.3 and: beautifulsoup4==4.9.1 boto3==1.14.60 botocore==1.17.60 bs4==0.0.1 camelot-py==0.8.2 certifi==2020.6.20 cffi==1.14.2 chardet==3.0.4 click==7.1.2 colorama==0.4.3 configparser==5.0.0 crayons==0.4.0 cryptography==3.1 distro==1.5.0 docutils==0.15.2 et-xmlfile==1.0.1 ghostscript==0.6 idna==2.10 jdcal==1.4.1 jmespath==0.10.0 lxml==4.5.2 numpy==1.19.2 opencv-python==4.4.0.42 openpyxl==3.0.5 pandas==1.1.2 pdfminer.six==20200726 psycopg2==2.8.6 pycparser==2.20 PyPDF2==1.26.0 python-dateutil==2.8.1 pytz==2020.1 requests==2.24.0 s3transfer==0.3.3 selenium==3.141.0 six==1.15.0 sortedcontainers==2.2.2 soupsieve==2.0.1 urllib3==1.25.10 webdriver-manager==3.2.2 xlrd==1.2.0 XlsxWriter==1.3.3

jasantos1976 commented 2 years ago

Did you find any solution? I have the same problem and haven't found a solution.