Closed novacellus closed 4 years ago
On closer analysis it seems that the issue occurs when JPageConverter stumbles upon <span class='ocr_textfloat'>
instead of <span class='ocr_line'>
.
Is that the correct nesting? ocr_carea - ocr_par - ocr_textfloat - ocrx_word Seems a bit odd.
That's what the tesseract 4 produces, I'm afraid. Is there's something wrong with the ocr_textfloat being embedded in an ocr_par, at least standard-wise? In that case it would be the same problem as in https://github.com/tesseract-ocr/tesseract/issues/2596.
I'm not sure, the spec doc for hocr is not very clear. It's also strange that there's no text line in the float. Are they one line only? Have you tried Alto output?
I ended up changing ocr_textfloat to ocr_line: ugly but worked for my purposes. In fact, ocr_textfloat may contain text words.
I do the same in the code now. As Tesseract outputs it, it virtually is a text line (even the ID says line). Will trickle through on next release
Thanks. Just found another custom block that is processed incorrectly: ocr_header (seems to be perfectly compliant with the hOCR specification: http://kba.cloud/hocr-spec/1.2/).
The problem is I can't find in the specification what the allowed child elements are. Does Tesseract output headers as text lines like the floats?
It seems that there are four possibilities: ocr_header, ocr_textfloat, ocr_caption, ocr_line. Cp. https://github.com/tesseract-ocr/tesseract/blob/master/src/api/hocrrenderer.cpp (211:225).
When converting tesseract HOCR files, PageConverter sometimes produces some messy blocks. Instead of representing the entire TextRegion in separate XML element, the content of the region is inserted as a Unicode element in one of the elements with each word on a separate line and blank lines inserted randomly. I attach an hocr and PAGE XML files for comparison: you can see the problem starting at par_7.
(tested with JPC 1.5 and 1.3)
LiegnUB--142.hocr.txt LiegnUB--142.xml.txt