DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
10.48k stars 507 forks source link

Support for HOCR? #366

Closed 4F2E4A2E closed 3 days ago

4F2E4A2E commented 3 days ago

Question

Hello and thank you for this repo. Is there a plan for hOCR [1] support?

1: https://en.wikipedia.org/wiki/HOCR

cau-git commented 3 days ago

@4F2E4A2E Hello, as I understand hOCR is a format specification for OCR software. Hence, this output should be expected from plain OCR software. Docling however, is not an OCR software by itself, it rather integrates OCR engines (EasyOCR, Tesseract, ...) to retrieve content from scanned parts of PDFs. This information, together with programmatic content of the PDF, is then processed in a specialized pipeline of AI models to recover layout, table structure and other data. The result of this pipeline is represented as a DoclingDocument data type, from which you can export into various formats, such as JSON, Markdown or Doctags. It is a document-centric, structured format that does not preserve raw output of text cells and coordinates such as OCR software produces.

4F2E4A2E commented 3 days ago

I was not aware that docling uses tesseract under the hood. Thanks for clarifying.