atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.65k stars 356 forks source link

fix input to text_in_bbox in stream.py #399

Closed edvardak closed 4 years ago

edvardak commented 4 years ago

From the definition of the text_in_bbox function, it is expected to receive the parameters in the (x1, y1, x2, y2) order:

def text_in_bbox(bbox, text):
    """Returns all text objects present inside a bounding box.

    Parameters
    ----------
    bbox : tuple
        Tuple (x1, y1, x2, y2) representing a bounding box where
        (x1, y1) -> lb and (x2, y2) -> rt in the PDF coordinate
        space.
    text : List of PDFMiner text objects.

    Returns
    -------
    t_bbox : list
        List of PDFMiner text objects that lie inside table.

    """
    lb = (bbox[0], bbox[1])
    rt = (bbox[2], bbox[3])
    t_bbox = [
        t
        for t in text
        if lb[0] - 2 <= (t.x0 + t.x1) / 2.0 <= rt[0] + 2
        and lb[1] - 2 <= (t.y0 + t.y1) / 2.0 <= rt[1] + 2
    ]
    return t_bbox

However, in the call to this function on line 305 in the stream.py module, this order is mixed up:

region_text = text_in_bbox((x1, y2, x2, y1), self.horizontal_text) This commit reorders them. This may be a problem on line 317.

Apologies for any faux-pas, this is my first ever contribution to a project!

amine-aboufirass commented 3 years ago

I've pip installed version 0.9.0 of the code and this is still a problem... Please see my issue logged here:

https://github.com/atlanhq/camelot/issues/462

Has this not been merged? If so why not?