atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

camelot with flavor stream detects a lot of text as tables. #259

Closed skywalker087 closed 5 years ago

skywalker087 commented 5 years ago

I was using camelot version 0.7.1 with python 3.6 on the pdf that i have attached with this issue and the result came out with a lot of tables but the majority of them consisted only of text from the pdf. I hope i am not doing anything wrong, if i am, please correct me.

file.pdf

vinayak-mehta commented 5 years ago

@skywalker087 I don't understand your issue. Can you point to the page numbers you're trying to extract tables from and also post the code that you used to do that? Please refer the contributor's guide for best practices around filing issues. [1]

[1] https://camelot-py.readthedocs.io/en/master/dev/contributing.html#filing-issues

skywalker087 commented 5 years ago

The table which I am worried about comes from page 2 and I have used the code as :

import platform; print(platform.platform()) import sys; print('Python', sys.version) import numpy; print('NumPy', numpy.version) import cv2; print('OpenCV', cv2.version) import camelot; print('Camelot', camelot.version) tables = camelot.read_pdf('./file.pdf', pages = 'all', flavor = 'stream', split_text = True) tables[1].df

Output:

output

skywalker087 commented 5 years ago

I hope I haven't missed any information here. Do you need anymore information on this issue?

vinayak-mehta commented 5 years ago

Stream will try to find tables from whatever text is used as input. Since page 2 doesn't contain any tables, you can do pages='1,3' instead of pages='all'.