Command line tool to convert page layout files to the latest PAGE XML format. It supports all previous versions of the PAGE format as well as ALTO XML, FineReader XML, and HOCR
Apache License 2.0
23
stars
6
forks
source link
hOCR doc not properly converted when lacking certain typesettings #18
Hi all,
The html-code below is the beginning of an hOCR-file. It has been hOCR-validated with hocr-spec.
In this state however, prima-page-converter fails to render any line below metadata.
The problem can be solved by adding a global
ocr_par
nested in a globalocr_carea
itself nested in theocr_page
area, like such :In that case, conversion to PAGE XML works fine. Is this normal ?