Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.34k stars 89 forks source link

NoneType error occured in pymupdf.output_to_markdown function #28

Closed mashihua closed 4 months ago

mashihua commented 4 months ago

Initial Checks

Description

I encountered the following error when processing a PDF file that contains multiple tables.

Traceback (most recent call last):
  File "/Users/mashihua/work/parser-test.py", line 29, in <module>
    parsed = parser.parse(basic_doc_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mashihua/anaconda3/lib/python3.11/site-packages/openparse/doc_parser.py", line 106, in parse
    table_elems = tables.ingest(doc, table_args_obj, verbose=self._verbose)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mashihua/anaconda3/lib/python3.11/site-packages/openparse/tables/parse.py", line 221, in ingest
    return _ingest_with_pymupdf(doc, parsing_args, verbose)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mashihua/anaconda3/lib/python3.11/site-packages/openparse/tables/parse.py", line 59, in _ingest_with_pymupdf
    text = pymupdf.output_to_markdown(headers, lines)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mashihua/anaconda3/lib/python3.11/site-packages/openparse/tables/pymupdf/parse.py", line 25, in output_to_markdown
    markdown_output = "| " + " | ".join(headers) + " |\n"
TypeError: sequence item 2: expected str instance, NoneType found

Example Code

No response

Filimoa commented 4 months ago

Fixed with #32