atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

Stream algorithm has possible bbox error #281

Closed arthur-b-renaud closed 5 years ago

arthur-b-renaud commented 5 years ago

From the stream.py file, starting line 281

``
def _generate_table_bbox(self):
self.textedges = []
if self.table_areas is None: hor_text = self.horizontal_text if self.table_regions is not None:

filter horizontal text

            hor_text = []
            for region in self.table_regions:
                x1, y1, x2, y2 = region
                region_text = text_in_bbox((x1, y2, x2, y1), self.horizontal_text)
                hor_text.extend(region_text)

``

The lines should state: x1, y1, x2, y2 = region region_text = text_in_bbox((x1, y1, x2, y2), self.horizontal_text)

vinayak-mehta commented 5 years ago

Hi @arthur-b-renaud, the usage is correct. Since the text_in_bbox function takes in bottom-left and top-right coordinates and the CLI/library input expects top-left and bottom-right coordinates (just as you would click-and-drag to select a table on an interface). I understand how it can be confusing, someone mentioned it in an earlier issue too. We can change the internal function calls to fix this in the future.

Please close this issue.

arthur-b-renaud commented 5 years ago

Great, thanks for you answer ! Yes, that was a bit disturbing...