Closed imene-swaan closed 2 weeks ago
@imene-swaan Thanks for bringing this up! Could you attach the pdf (even a single page is enough)? We will look into it asap!
@imene-swaan thanks for the detailed report!
To clarify, OCR can not help you in this case, because docling does not run OCR unless there is an actual bitmap resource detected in the PDF. Hence, OCR will never trigger on programmatic text, even if the font is unknown.
As a temporary workaround, you can choose a different PDF backend for the case, e.g. PyPdfiumDocumentBackend
, to see if this helps (Note that this workaround may come with other issues, such as merged table rows).
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(backend=PyPdfiumDocumentBackend)
}
)
@cau-git It could be beneficial to save the pdf pages as images and then trigger the OCR in such cases.
Also, I tried PyPdfiumDocumentBackend
and the results are the same.
@imene-swaan I think it is a particular font we do not (yet) support. If you can provide us a simple pdf sample, I might fix it for you.
@PeterStaar-IBM here's an example: https://content.influencemap.org//site/data/000/982/Enel_corporate_website_energy_mix_June_2022_June_2022.pdf
@imene-swaan I tracked it down. Bottom line, some web-browsers have very bad pdf-printers (meaning that they dont encode the text). You can test it yourself by trying to copy the text and then paste it into a text-file. What you see is mangled characters, because they only care about the printing of it.
This gives you two options:
I tested it directly with this command,
poetry run docling --from html --to md "https://www.enel.com/company/stories/articles/2022/06/projects-innovative-electrification-renewables" --output ./scratch/
I found some issues and have fixed them in this PR (https://github.com/DS4SD/docling/pull/240). The output is pretty good (from the PR),
I will review the PR with my colleagues and make sure it get in asap!
Thanks for pointing this issue out!
@PeterStaar-IBM As I've mentioned in my issue description, the main issue seems to be the OCR not being applied even if I specifiy do_ocr=True
. @cau-git mentioned that OCR is not triggered unless there is an actual bitmap resource element.
An ideal solution would be to force trigger OCR if the font is unknown and do_ocr=True
.
@imene-swaan Yes, we are adding indeed the forced OCR feature!
@PeterStaar-IBM I have opened the PR #290 which introduces the parameter OcrOptions.force_full_page_ocr
.
I have tried with the provided sample PDF document and it seems to work well.
Please check this example that demonstrates how to force OCR: https://github.com/DS4SD/docling/blob/force_ocr/docs/examples/full_page_ocr.py
Description:
I'm using Docling to parse a PDF that contains text. The PDF appears to use a non-standard font or encoding, as copying text directly from it also yields garbled characters. Despite setting
do_ocr=True
and specifyingTesseract
as the OCR engine, Docling's output remains unreadable. Testing with Docling v1 produces a different, but similarly unreadable, output containing placeholder glyphs.Here’s an example of the output generated by the current Docling version:
When using Docling v1, the output looks like this instead:
Steps to Reproduce:
PdfPipelineOptions
withdo_ocr=True
to enable OCR.ocr_options
to useTesseractCliOcrOptions
.Expected Behavior:
Docling should apply OCR, yielding readable output.
Observed Behavior:
The output consists of unreadable characters or placeholder glyphs, suggesting that Docling is not applying OCR despite
do_ocr=True
.Environment:
Troubleshooting Steps Taken:
Additional Information:
When copying text directly from the PDF, it appears garbled, as follows:
When examining the PDF's font properties, I found that it uses
Type3
fonts. The code for inspecting the font:The output:
The fact that Tesseract works independently implies that Docling might not be applying OCR correctly, even though
do_ocr=True
and Tesseract is specified as the engine. The differing outputs between Docling v1 and the current version may also indicate a change in how Docling handles such PDFs. Any insights or solutions for handling PDFs with embedded fonts would be greatly appreciated.