atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.65k stars 356 forks source link

use table extractor with table_areas #367

Closed CartierPierre closed 4 years ago

CartierPierre commented 5 years ago

Hi, I'm having an issue with the table areas. When I use the webserver to select an area to extract, I can get the table. But it is not corresponding with an other area selection in pdf with an other software (adobe, pdfviewer or pdf2img).

So when I put the table_areas param with the true values (which I supposed are from adobe, and other 😄 ), camelot is extracting the wrong areas. Is it possible to unify this ?

In addition, why making a list of string ["area1", "area2"] instead of list of list [[area1],[area2]]. It is memory lighter and don't need to split on string to extract x1,y1,x2,y2

Can you take a look ?

PS : I sent you a mail about a new method mixed of Lattice and Stream

CartierPierre commented 5 years ago

https://github.com/atlanhq/camelot/blob/0efb3ca1b0ad382c2ed2f5c503c16901b3251421/camelot/utils.py#L193

Is it possible to pass throught this ? If the area is already image scaled.

vinayak-mehta commented 5 years ago

Closed in favor of https://github.com/camelot-dev/camelot/issues/40.