Closed sharvi8 closed 5 years ago
tables_stream = camelot.read_pdf(pdf, flavor='stream', pages='all', table_areas=['0,842,595,0'])
I have used this approach and this worked as an alternate to make stream read the complete page and give all the tables as output. The output looks good for an A4 size PDF and can be used as a workaround.
Please let me know what you suggest?
Thanks, this does help to a certain extent. But i have 600 pdfs and i have written a code to create the first two tables of each pdf and keep appending them till i have two large tables from all the 600 pdfs.
If I just had to parse this pdf, your approach would help. But if I want to write an automated loop, I suppose it will not. Do you know if there is a way within lattice to recognize this table? The only reason it is not being recognized in some of the pdfs is because there is only one row. In other pdfs this table is being recognized because the number of rows is large. So i am assuming this is an inability to detect because of size - shouldn't line space option be able to help?
@sharvi8 I believe you're looking for line_scale
and not line_space
. There is no keyword argument called line_space
inside the library. You can check out this section of the advanced docs for more information.
Sorry for the late reply here, I'm closing this. Please reopen if you still face the same issue.
Hi, I'm facing the same issue. I tried tables_stream = camelot.read_pdf(pdf, flavor='stream', pages='all', table_areas=['0,842,595,0']) but it turns out that camelot not only extracts the first page but more. Is there any other way to detect page by page? Also, I'm sorry if I miss anything here, how to find the table areas? Thanks
@bettychou1993 You can do some visual debugging to find out table area and column separator coordinates.
IR-A-C-24004.pdf
i have used the following options: process_background = True line_space - played around with several numbers flavour='stream'
not sure what else to do.