Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.34k stars 89 forks source link

ValueError: Coordinate 'right' is less than 'left' #11

Open atgreen opened 5 months ago

atgreen commented 5 months ago

Given this code:

import openparse

basic_doc_path = "mydoc.pdf"
parser = openparse.DocumentParser(
    table_args={
        "parsing_algorithm": "unitable",
        "min_table_confidence": 0.8,
    }
)

parsed_basic_doc = parser.parse(basic_doc_path)

for node in parsed_basic_doc.nodes:
    print(node.json())

I'm getting the following error:

  File "/home/green/git/cl-langtools/test.py", line 11, in <module>
    parsed_basic_doc = parser.parse(basic_doc_path)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/green/git/cl-langtools/tools/open-parse/lib64/python3.12/site-packages/openparse/doc_parser.py", line 106, in parse
    table_elems = tables.ingest(doc, table_args_obj, verbose=self._verbose)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/green/git/cl-langtools/tools/open-parse/lib64/python3.12/site-packages/openparse/tables/parse.py", line 223, in ingest
    return _ingest_with_unitable(doc, parsing_args, verbose)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/green/git/cl-langtools/tools/open-parse/lib64/python3.12/site-packages/openparse/tables/parse.py", line 189, in _ingest_with_unitable
    table_str = table_img_to_html(table_img)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/green/git/cl-langtools/tools/open-parse/lib64/python3.12/site-packages/openparse/tables/unitable/core.py", line 192, in table_img_to_html
    pred_cell_lst = predict_cells(image_tensor, pred_bbox, table_image)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/green/git/cl-langtools/tools/open-parse/lib64/python3.12/site-packages/openparse/tables/unitable/core.py", line 160, in predict_cells
    _image_to_tensor(image.crop(bbox), size=(112, 448)) for bbox in pred_bboxes
                     ^^^^^^^^^^^^^^^^
  File "/home/green/git/cl-langtools/tools/open-parse/lib64/python3.12/site-packages/PIL/Image.py", line 1237, in crop
    raise ValueError(msg)
ValueError: Coordinate 'right' is less than 'left'

If it helps, my input document is this one: https://www.rbc.com/investor-relations/_assets-custom/pdf/ar_2023_e.pdf

giovannibonetti commented 5 months ago

Thanks for this great library.

I'm also getting ValueError: Coordinate 'right' is less than 'left' with this PDF and almost the same code:

import openparse

basic_doc_path = "sample.pdf"
parser = openparse.DocumentParser(
    table_args={
        "parsing_algorithm": "unitable",
        "min_table_confidence": 0.8
    },
)

parsed_doc = parser.parse(basic_doc_path)