camelot-dev / camelot

A Python library to extract tabular data from PDFs
https://camelot-py.readthedocs.io
MIT License
2.99k stars 472 forks source link

Cannot find table in pdf #144

Open ayushnarsaria opened 4 years ago

ayushnarsaria commented 4 years ago

Hello,

I am using the python camelot. The package is unable to find tables in any pdfs. It always shows me , even though the pdfs have table embedded in them.

PFA the example pdf. I hope this problem can be resolved soon.

Thank you

Best, Ayush jcc.26188.pdf

abhilashabhardwaj commented 4 years ago

Hi, have you tried using stream flavor?

page_4 = camelot.read_pdf(jcc.26188.pdf, flavor='stream', pages='4') page_4 gives two tables, the table you want to extract is the first table here.

array([['T A B L E 1', '', '', '', '', '', ''], ['', 'Statistics and error analysis of TDDFT functionals compared to experimental λmax values for the lowest dipole-allowed vertical', '', '', '', '', ''], ['excitation energy (Evert-abso(DCM),', 'in eV)', 'in dichloromethane calculated using the COSMO solvation modela', '', '', '', ''], ['', 'GGA', '', 'GH', '', 'RSH', ''], ['Statistical parameters', 'OLYP', 'BLYP', 'B3LYP', 'PBE0', 'LCY-BLYP', 'CAMY-B3LYP'], ['R2', '0.86', '0.86', '0.90', '0.92', '0.98', '0.96'], ['MD', '−0.39', '−0.42', '−0.09', '0.00', '0.52', '0.16'], ['MAD', '0.39', '0.42', '0.09', '0.07', '0.52', '0.16'], ['MAX(+)b (eV)', '-', '-', '0.03', '0.13', '0.61', '0.24'], ['MAX(−)b (eV)', '−0.72', '−0.74', '−0.36', '−0.25', '-', '−0.01']], dtype=object)

I understand this can be improved with using config parameters, I hope you get time to try it, I'll try to update it when I get time to look into this further.

Here is the parsing report: {'accuracy': 96.53, 'whitespace': 28.57, 'order': 1, 'page': 4}