atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 355 forks source link

Make camelot search for tables in certain page regions #209

Closed anakin87 closed 5 years ago

anakin87 commented 5 years ago

I'm trying to automatically detect and extract tables encapsulated in other tables.

I would want to make camelot search in certain area: this is not table area but the area where the table resides (see the attached image). cattura

How I can make Camelot work in this way? Ideas for the develop are well-accepted...

vinayak-mehta commented 5 years ago

Hi @anakin87! You can specify table areas in read_pdf using the table_areas kwarg. For more information on usage, check out the docs. Please comment if you face any problems.

anakin87 commented 5 years ago

If I provide table_areas, Camelot interprets them as specific table coordinates.

My problem is that I want to search for tables in a specific area of the page, but I don't know specific table coordinates. How to cope with this problem?

vinayak-mehta commented 5 years ago

I get the issue now. Camelot treats the passed table areas as actual boundaries of the table. This can be an enhancement where the user can pass a table_region so that camelot only processes the text and lines inside the region to form a table. Reopening this.

vinayak-mehta commented 5 years ago

@anakin87 Can you post a link to that PDF?

anakin87 commented 5 years ago

PIR_Prospetto dOfferta.pdf

I would want to search for tables in a certain region of the page, in the order to extract only true tables and not tables that are elements of layout.

vinayak-mehta commented 5 years ago

@anakin87 Thanks for reporting this issue, the current table_areas kwarg for Lattice hardcodes the coordinates of the table boundary leading to unwanted text with the extracted table and making the user note the exact coordinates while debugging visually. Which should not be the case, table_areas should just guide camelot to analyze only that part of the page to find tables using Lattice and Stream.

This is a behavioral bug, I'll push a fix today.

anakin87 commented 5 years ago

I think both the options are useful:

vinayak-mehta commented 5 years ago

Hmm, I guess keeping them separate makes sense since a table region could contain two or more table areas too.

vinayak-mehta commented 5 years ago

@anakin87 Check out the docs for usage details.

anakin87 commented 5 years ago

Great!!!

gyan7611 commented 4 years ago

How do you get the coordinates to be passed as argument to table_areas ?