Open Arslan-Mehmood1 opened 5 days ago
@Arslan-Mehmood1 Some PDFs simply have garbled text layers like these, with no rescue. Some strategies that could help:
@cau-git Thanks man. I'll test and report back here.
- Check what you get when using our docling-parse-v2 or our pypdfium PDF backends
@cau-git Is there a general recommendation which of the two backend perform better in most cases? Is there some kind of documentation where you discuss the differences/tradeoffs between the two backends?
in case any one needs the link to documentation containing all different methods of inference for docling: https://ds4sd.github.io/docling/examples/full_page_ocr/
@cau-git thanks for help, I used following config for docling inference and the issue got resolved.
# Set up the pipeline options for PDF conversion
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure=True
pipeline_options.table_structure_options.do_cell_matching = True # uses text cells predicted from table structure model
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
# Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions
ocr_options = EasyOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
# ocr_options = OcrMacOptions(force_full_page_ocr=True)
# ocr_options = RapidOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options
Bug
PDF - font
Markdown Results
...
Docling version
2.8.0 ...
Python version
... 3.10.12