atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.62k stars 350 forks source link

Can't extract all the tables on each page #349

Closed cqluohong closed 5 years ago

cqluohong commented 5 years ago

I have some pdf where are two tables in one page ,but I can not extract the small one,Can adjust the extraction accuracy to ensure that small tables are not discarded SD17GL9E4AR~U(V9YPC8{ P

anakin87 commented 5 years ago

Please attach the PDF...

cqluohong commented 5 years ago

Please attach the PDF...

430027-北科光大-2017年年度报告.pdf in this pdf with page 55

cqluohong commented 5 years ago

there is another question,how to handle pdfminer.psparser.PSSyntaxError,I watched #161 ,Need to be repaired by mutool,but camelot use pdfminder as same as pdfplumber ,pdfplumber worked

anakin87 commented 5 years ago

To extract the tables from the file you provided, you have to set parameter line_scale=80.

See https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines

cqluohong commented 5 years ago

To extract the tables from the file you provided, you have to set parameter line_scale=80.

See https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines

thank you,You helped me a lot.

vinayak-mehta commented 5 years ago

Looks like that solved the issue, closing this. Thanks @anakin87.

nachiket8188 commented 4 years ago

To extract the tables from the file you provided, you have to set parameter line_scale=80.

See https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines

Hi @anakin87 , I, too, faced a similar issue and your solution helped. Thanks. However, it'd be great if you helped me understand the behavior of _linescale parameter. As I noticed, the thickness of border lines of different tables in source PDF is the same. Then why is it so that camelot is able to identify certain tables and not able to identify other ones (especially those with fewer than 4 rows) ?? Thanks in advance. :)