DS4SD / docling

Get your documents ready for gen AI

https://ds4sd.github.io/docling

MIT License

11.72k stars 585 forks source link

PDF convert error #434

Open FengCeUp opened 4 days ago

FengCeUp commented 4 days ago

Question

I just ran a test with my PDF document and found that an error was reported and the output document was empty. Here's a snippet of my code:

input_doc_path = "C:\Users\lenovo\Downloads\HuaShan\Computracev1.2.pdf"

pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = False pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True

doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } )

start_time = time.time()

conv_result = doc_converter.convert(input_doc_path, raises_on_error=False)

FengCeUp commented 4 days ago

By the way, I change the backend to PyPdfiumDocumentBackend, it works right.

dolfim-ibm commented 4 days ago

@FengCeUp can you please share the file?

FengCeUp commented 4 days ago

TestDoc3.pdf @dolfim-ibm This is my test file.

dolfim-ibm commented 4 days ago

I confirm this can be reproduced with both parser v1 and v2.

# parser v1
docling --pdf-backend dlparse_v1 TestDoc3.pdf

# parser v2
docling --pdf-backend dlparse_v2 TestDoc3.pdf

The error message is:

RuntimeError: font_name [/F1] is not known:

FengCeUp commented 4 days ago

@dolfim-ibm Yes, when i use cli to convert this doc, parser v2 give the error message "RuntimeError: font_name [/F1] is not known: "

pitta-bread commented 1 day ago

I confirm this can be reproduced with both parser v1 and v2.
# parser v1
docling --pdf-backend dlparse_v1 TestDoc3.pdf

# parser v2
docling --pdf-backend dlparse_v2 TestDoc3.pdf
The error message is:
RuntimeError: font_name [/F1] is not known: 

I am also experiencing this exact same issue. I cannot identify anything consistent across the PDFs which trigger it but it happens always with a subset of my PDF files. I am implementing a workaround for now, to switch to rerun with PyPdfiumDocumentBackend when a blank output is the case.

PeterStaar-IBM commented 1 day ago

In the next PR (https://github.com/DS4SD/docling-parse/pull/57), this will be resolved. The parse is now spitting out,

TestDoc3.pdf.json

visualization of page 4 (orig versus parsed)

Screenshot 2024-11-29 at 13 42 40