docling identified my entire page as a picture

DS4SD / docling

Get your documents ready for gen AI

https://ds4sd.github.io/docling

MIT License

10.48k stars 507 forks source link

docling identified my entire page as a picture #357

Open aodingpeng opened 4 days ago

aodingpeng commented 4 days ago

Bug

I need to identify this page, but it seems that Docling has recognized my page as an image

file： ISO IEC 23090-5DUP.pdf

Is there any way to solve this problem?

mllife commented 4 days ago

This is a scanned document. You should use OCR argument to parse it.

aodingpeng commented 4 days ago

This is a scanned document. You should use OCR argument to parse it.

I added OCR to my final command, but the layout analysis still referred to the image

aodingpeng commented 3 days ago

这是扫描的文档。您应该使用 OCR 参数来解析它。我是新手，请问如何才能启动源码？

cau-git commented 3 days ago

@aodingpeng I will investigate this issue. My suspicion is that the layout of this page is wrongly detected as a full page picture, hence all content in the detected picture is lost (so far Docling ignores in-picture text). OCR won't solve this alone.