atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.65k stars 356 forks source link

how to get the parameter value of table_areas? #358

Closed lycanthropes closed 5 years ago

lycanthropes commented 5 years ago

I think add a param of table_areas when using read_pdf function could be better to extract table from pdf file, but where can I get the value of table_areas? I just tried using pdfplumber.open (x.pdf) to get all the items ( each items has its pdf coordinates and width\ height) in one page of pdf file, and I took the pdf coordinate of four corner items as the param values of table_areas , however it didn't work and it was even worse than purely read_pdf function without the table_areas param. I wonder if camelot could detect the pdf coordinate of table's areas or items in the table.

anakin87 commented 5 years ago

Read this: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging

You can do some visual debugging.

lycanthropes commented 5 years ago

Thank you .But how to get the dynamic picture? having followed the visual debugging, I found the picture I made was static so I can not get the precise coordinate of the table.

anakin87 commented 5 years ago

Don't know.

But if you don't have the precise coordinates, you can specify a broader area, using table_regions (https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-regions).

lycanthropes commented 5 years ago

Ok, thank you so much. I think that is a good solution.