DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
12.27k stars 616 forks source link

Docling having issue processing this font in pdf #460

Open Arslan-Mehmood1 opened 5 days ago

Arslan-Mehmood1 commented 5 days ago

Bug

PDF - font

image

Markdown Results image

...

Docling version

2.8.0 ...

Python version

... 3.10.12

cau-git commented 4 days ago

@Arslan-Mehmood1 Some PDFs simply have garbled text layers like these, with no rescue. Some strategies that could help:

  1. Check what you get when using our docling-parse-v2 or our pypdfium PDF backends
  2. Enable force OCR, such that the full document is treated with OCR instead of relying on the PDF backend output
Arslan-Mehmood1 commented 4 days ago

@cau-git Thanks man. I'll test and report back here.

simonschoe commented 3 days ago
  1. Check what you get when using our docling-parse-v2 or our pypdfium PDF backends

@cau-git Is there a general recommendation which of the two backend perform better in most cases? Is there some kind of documentation where you discuss the differences/tradeoffs between the two backends?

Arslan-Mehmood1 commented 2 days ago

in case any one needs the link to documentation containing all different methods of inference for docling: https://ds4sd.github.io/docling/examples/full_page_ocr/

Arslan-Mehmood1 commented 2 days ago

@cau-git thanks for help, I used following config for docling inference and the issue got resolved.

# Set up the pipeline options for PDF conversion
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure=True
pipeline_options.table_structure_options.do_cell_matching = True  # uses text cells predicted from table structure model
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE

# Any of the OCR options can be used:EasyOcrOptions, TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions(Mac only), RapidOcrOptions
ocr_options = EasyOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
# ocr_options = OcrMacOptions(force_full_page_ocr=True)
# ocr_options = RapidOcrOptions(force_full_page_ocr=True)
# ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options.ocr_options = ocr_options