DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
11.69k stars 582 forks source link

Docling Produces Unreadable Text Output for PDF with non-standard Font Encoding, OCR Appears Not to be Applied #185

Closed imene-swaan closed 2 weeks ago

imene-swaan commented 1 month ago

Description:

I'm using Docling to parse a PDF that contains text. The PDF appears to use a non-standard font or encoding, as copying text directly from it also yields garbled characters. Despite setting do_ocr=True and specifying Tesseract as the OCR engine, Docling's output remains unreadable. Testing with Docling v1 produces a different, but similarly unreadable, output containing placeholder glyphs.

Here’s an example of the output generated by the current Docling version:

'()* +,- .+..  /01 02034567638469:; 4<8:=> -                 '()* +,- .+..  /01 02034567638469:; 4<8:=> 4-                 '()* +,- .+..

When using Docling v1, the output looks like this instead:

GLYPH<38> GLYPH<39> GLYPH<40> GLYPH<41> GLYPH<i255> GLYPH<43> GLYPH<44> GLYPH<45> GLYPH<i255> GLYPH<46> GLYPH<43> GLYPH<46> GLYPH<46>
## GLYPH<47> GLYPH<48> GLYPH<49>GLYPH<i255> GLYPH<48> GLYPH<51> GLYPH<48> GLYPH<52> ...

Steps to Reproduce:

Expected Behavior:

Docling should apply OCR, yielding readable output.

Observed Behavior:

The output consists of unreadable characters or placeholder glyphs, suggesting that Docling is not applying OCR despite do_ocr=True.

Environment:

Troubleshooting Steps Taken:

Additional Information:

When copying text directly from the PDF, it appears garbled, as follows:

WX?6469Y>ÿZ28:>
[ELAÿ'(OU-ÿPBAMÿAM*ÿ\]^ÿ_`aÿbbÿQEDcEC*-ÿAM*ÿV

When examining the PDF's font properties, I found that it uses Type3 fonts. The code for inspecting the font:

import fitz  # PyMuPDF
doc = fitz.open(pdf_path)
page = doc[0]
fonts = page.get_fonts(full=True)
print(fonts)

The output:

[(821, 'n/a', 'Type3', 'T1', 'T1', '', 0)]

The fact that Tesseract works independently implies that Docling might not be applying OCR correctly, even though do_ocr=True and Tesseract is specified as the engine. The differing outputs between Docling v1 and the current version may also indicate a change in how Docling handles such PDFs. Any insights or solutions for handling PDFs with embedded fonts would be greatly appreciated.

PeterStaar-IBM commented 1 month ago

@imene-swaan Thanks for bringing this up! Could you attach the pdf (even a single page is enough)? We will look into it asap!

cau-git commented 1 month ago

@imene-swaan thanks for the detailed report!

To clarify, OCR can not help you in this case, because docling does not run OCR unless there is an actual bitmap resource detected in the PDF. Hence, OCR will never trigger on programmatic text, even if the font is unknown.

As a temporary workaround, you can choose a different PDF backend for the case, e.g. PyPdfiumDocumentBackend, to see if this helps (Note that this workaround may come with other issues, such as merged table rows).

doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(backend=PyPdfiumDocumentBackend)
        }
    )
imene-swaan commented 1 month ago

@cau-git It could be beneficial to save the pdf pages as images and then trigger the OCR in such cases.

Also, I tried PyPdfiumDocumentBackend and the results are the same.

PeterStaar-IBM commented 1 month ago

@imene-swaan I think it is a particular font we do not (yet) support. If you can provide us a simple pdf sample, I might fix it for you.

imene-swaan commented 4 weeks ago

@PeterStaar-IBM here's an example: https://content.influencemap.org//site/data/000/982/Enel_corporate_website_energy_mix_June_2022_June_2022.pdf

PeterStaar-IBM commented 3 weeks ago

@imene-swaan I tracked it down. Bottom line, some web-browsers have very bad pdf-printers (meaning that they dont encode the text). You can test it yourself by trying to copy the text and then paste it into a text-file. What you see is mangled characters, because they only care about the printing of it.

This gives you two options:

  1. Try using OCR: We have several OCR options (easyOCR and tesserocr).
  2. Leverage our native HTML: I think this is the preferred option. If you are anyway printing a webpage, it might be much faster to parse the HTML directly.

I tested it directly with this command,

poetry run docling --from html --to md "https://www.enel.com/company/stories/articles/2022/06/projects-innovative-electrification-renewables" --output ./scratch/

I found some issues and have fixed them in this PR (https://github.com/DS4SD/docling/pull/240). The output is pretty good (from the PR),

Screenshot 2024-11-05 at 06 53 27

I will review the PR with my colleagues and make sure it get in asap!

Thanks for pointing this issue out!

imene-swaan commented 3 weeks ago

@PeterStaar-IBM As I've mentioned in my issue description, the main issue seems to be the OCR not being applied even if I specifiy do_ocr=True. @cau-git mentioned that OCR is not triggered unless there is an actual bitmap resource element.

An ideal solution would be to force trigger OCR if the font is unknown and do_ocr=True.

PeterStaar-IBM commented 3 weeks ago

@imene-swaan Yes, we are adding indeed the forced OCR feature!

nikos-livathinos commented 2 weeks ago

@PeterStaar-IBM I have opened the PR #290 which introduces the parameter OcrOptions.force_full_page_ocr.

I have tried with the provided sample PDF document and it seems to work well.

Please check this example that demonstrates how to force OCR: https://github.com/DS4SD/docling/blob/force_ocr/docs/examples/full_page_ocr.py