Closed 4F2E4A2E closed 3 days ago
@4F2E4A2E Hello, as I understand hOCR is a format specification for OCR software. Hence, this output should be expected from plain OCR software. Docling however, is not an OCR software by itself, it rather integrates OCR engines (EasyOCR, Tesseract, ...) to retrieve content from scanned parts of PDFs. This information, together with programmatic content of the PDF, is then processed in a specialized pipeline of AI models to recover layout, table structure and other data. The result of this pipeline is represented as a DoclingDocument
data type, from which you can export into various formats, such as JSON, Markdown or Doctags. It is a document-centric, structured format that does not preserve raw output of text cells and coordinates such as OCR software produces.
I was not aware that docling uses tesseract under the hood. Thanks for clarifying.
Question
Hello and thank you for this repo. Is there a plan for hOCR [1] support?
1: https://en.wikipedia.org/wiki/HOCR