PRImA-Research-Lab / prima-page-converter

Command line tool to convert page layout files to the latest PAGE XML format. It supports all previous versions of the PAGE format as well as ALTO XML, FineReader XML, and HOCR
Apache License 2.0
23 stars 6 forks source link

Page Converter producing messy Unicode blocks #14

Closed novacellus closed 4 years ago

novacellus commented 4 years ago

When converting tesseract HOCR files, PageConverter sometimes produces some messy blocks. Instead of representing the entire TextRegion in separate XML element, the content of the region is inserted as a Unicode element in one of the elements with each word on a separate line and blank lines inserted randomly. I attach an hocr and PAGE XML files for comparison: you can see the problem starting at par_7.

(tested with JPC 1.5 and 1.3)

LiegnUB--142.hocr.txt LiegnUB--142.xml.txt

novacellus commented 4 years ago

On closer analysis it seems that the issue occurs when JPageConverter stumbles upon <span class='ocr_textfloat'> instead of <span class='ocr_line'>.

chris1010010 commented 4 years ago

Is that the correct nesting? ocr_carea - ocr_par - ocr_textfloat - ocrx_word Seems a bit odd.

novacellus commented 4 years ago

That's what the tesseract 4 produces, I'm afraid. Is there's something wrong with the ocr_textfloat being embedded in an ocr_par, at least standard-wise? In that case it would be the same problem as in https://github.com/tesseract-ocr/tesseract/issues/2596.

chris1010010 commented 4 years ago

I'm not sure, the spec doc for hocr is not very clear. It's also strange that there's no text line in the float. Are they one line only? Have you tried Alto output?

novacellus commented 4 years ago

I ended up changing ocr_textfloat to ocr_line: ugly but worked for my purposes. In fact, ocr_textfloat may contain text words.

chris1010010 commented 4 years ago

I do the same in the code now. As Tesseract outputs it, it virtually is a text line (even the ID says line). Will trickle through on next release

novacellus commented 4 years ago

Thanks. Just found another custom block that is processed incorrectly: ocr_header (seems to be perfectly compliant with the hOCR specification: http://kba.cloud/hocr-spec/1.2/).

chris1010010 commented 4 years ago

The problem is I can't find in the specification what the allowed child elements are. Does Tesseract output headers as text lines like the floats?

novacellus commented 4 years ago

It seems that there are four possibilities: ocr_header, ocr_textfloat, ocr_caption, ocr_line. Cp. https://github.com/tesseract-ocr/tesseract/blob/master/src/api/hocrrenderer.cpp (211:225).