Open FengCeUp opened 4 days ago
By the way, I change the backend to PyPdfiumDocumentBackend, it works right.
@FengCeUp can you please share the file?
TestDoc3.pdf @dolfim-ibm This is my test file.
I confirm this can be reproduced with both parser v1 and v2.
# parser v1
docling --pdf-backend dlparse_v1 TestDoc3.pdf
# parser v2
docling --pdf-backend dlparse_v2 TestDoc3.pdf
The error message is:
RuntimeError: font_name [/F1] is not known:
@dolfim-ibm Yes, when i use cli to convert this doc, parser v2 give the error message "RuntimeError: font_name [/F1] is not known: "
I confirm this can be reproduced with both parser v1 and v2.
# parser v1 docling --pdf-backend dlparse_v1 TestDoc3.pdf # parser v2 docling --pdf-backend dlparse_v2 TestDoc3.pdf
The error message is:
RuntimeError: font_name [/F1] is not known:
I am also experiencing this exact same issue. I cannot identify anything consistent across the PDFs which trigger it but it happens always with a subset of my PDF files. I am implementing a workaround for now, to switch to rerun with PyPdfiumDocumentBackend when a blank output is the case.
In the next PR (https://github.com/DS4SD/docling-parse/pull/57), this will be resolved. The parse is now spitting out,
visualization of page 4 (orig versus parsed)
Question
I just ran a test with my PDF document and found that an error was reported and the output document was empty. Here's a snippet of my code:
input_doc_path = "C:\Users\lenovo\Downloads\HuaShan\Computracev1.2.pdf"
pipeline_options = PdfPipelineOptions() pipeline_options.do_ocr = False pipeline_options.do_table_structure = True pipeline_options.table_structure_options.do_cell_matching = True
doc_converter = DocumentConverter( format_options={ InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options) } )
start_time = time.time()
conv_result = doc_converter.convert(input_doc_path, raises_on_error=False)