Filimoa / open-parse

Improved file parsing for LLM’s
https://filimoa.github.io/open-parse/
MIT License
2.44k stars 95 forks source link

fix: Fix sequence item 2: expected str instance, NoneType found exception when table output is set to markdown. #27

Closed ic-xu closed 6 months ago

ic-xu commented 6 months ago

behavior:

I get an exception as follows:

/python3.10/site-packages/openparse/tables/pymupdf/parse.py", line 25, in output_to_markdown
 markdown_output = "| " + " | ".join(headers) + " |\n"
TypeError: sequence item 2: expected str instance, NoneType found

When parsing PDF tables, the output format is set to

table_args={
 "parsing_algorithm": "pymupdf",
 "table_output_format": "markdown"
 }

After analysis, I found that the reason may be the following: When the headers of the table are:

header = ['(See Note 11)', '', None, None]

Then execute the following code

 markdown_output = "| " + " | ".join(headers) + " |\n"
 markdown_output += "|---" * len(headers) + "|\n"

You will get the following error

/python3.10/site-packages/openparse/tables/pymupdf/parse.py", line 25, in output_to_markdown
 markdown_output = "| " + " | ".join(headers) + " |\n"
TypeError: sequence item 2: expected str instance, NoneType found

So my solution is to replace None with ' ' to solve this problem