atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.65k stars 357 forks source link

Unable to detect first table in attached pdf #308

Closed sharvi8 closed 5 years ago

sharvi8 commented 5 years ago

IR-A-C-24004.pdf

i have used the following options: process_background = True line_space - played around with several numbers flavour='stream'

not sure what else to do.

csrinivascreator commented 5 years ago

tables_stream = camelot.read_pdf(pdf, flavor='stream', pages='all', table_areas=['0,842,595,0'])

I have used this approach and this worked as an alternate to make stream read the complete page and give all the tables as output. The output looks good for an A4 size PDF and can be used as a workaround.

Please let me know what you suggest?

sharvi8 commented 5 years ago

Thanks, this does help to a certain extent. But i have 600 pdfs and i have written a code to create the first two tables of each pdf and keep appending them till i have two large tables from all the 600 pdfs.

If I just had to parse this pdf, your approach would help. But if I want to write an automated loop, I suppose it will not. Do you know if there is a way within lattice to recognize this table? The only reason it is not being recognized in some of the pdfs is because there is only one row. In other pdfs this table is being recognized because the number of rows is large. So i am assuming this is an inability to detect because of size - shouldn't line space option be able to help?

vinayak-mehta commented 5 years ago

@sharvi8 I believe you're looking for line_scale and not line_space. There is no keyword argument called line_space inside the library. You can check out this section of the advanced docs for more information.

vinayak-mehta commented 5 years ago

Sorry for the late reply here, I'm closing this. Please reopen if you still face the same issue.

bettychou1993 commented 5 years ago

Hi, I'm facing the same issue. I tried tables_stream = camelot.read_pdf(pdf, flavor='stream', pages='all', table_areas=['0,842,595,0']) but it turns out that camelot not only extracts the first page but more. Is there any other way to detect page by page? Also, I'm sorry if I miss anything here, how to find the table areas? Thanks

vinayak-mehta commented 5 years ago

@bettychou1993 You can do some visual debugging to find out table area and column separator coordinates.