atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

Camelot unable to detect last column #268

Closed asmiy closed 5 years ago

asmiy commented 5 years ago

For this pdf 329.pdf, I'm trying to extract the table using flavor = 'stream', and specifying both table and columns coordinates, but can't detect the whole table. I tested it with excalibur and it worked, why camelot can't detect it?

Could you help?

vinayak-mehta commented 5 years ago

@asmiy Can you also post the code that you used where you specified table areas and columns?

asmiy commented 5 years ago

@vinayak-mehta

 camelot.read_pdf("329.pdf",flavor='stream',table_areas=['71,716,552,376'],columns = ['222.5,293, 373.7,427.7, 483.1']) 
anakin87 commented 5 years ago

Try with: tables=camelot.read_pdf("329.pdf",flavor='stream',table_areas=['71,730,580,400'])

asmiy commented 5 years ago

@anakin87 it's working! Question : how do you get the coordinates? For me, I have top left point and the size of table (knowing that the top left point of the pdf is (0,0), and the units used is points ), that I convert to the get the coordinates required by camelot.

vinayak-mehta commented 5 years ago

@asmiy You can get the required top-left and bottom-right table area coordinates by plotting the text.

Camelot and pdfminer treat the bottom-left point as origin.

Thanks for the fix @anakin87! @asmiy If that solved your problem, please close this issue.

asmiy commented 5 years ago

Thanks!