Closed zuphilip closed 4 years ago
Obviously ALTO uses a <HYP>
tag which is currently not handled by the transformation. I did not find a similar tag in the hOCR specification.
http://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/hocr/417576986_0012.hocr is the result of an extended style sheet where I used a quick hack to produce something useful.
Hm.. I see an extra node in your result which results in another space. Our goal is to receive
<span class="ocrx_word" id="word_d1e73" title="bbox 596 132 675 158">con-</span>
I think we can try to replace these lines with something like
<xsl:template match="String">
<span class="ocrx_word" id="{mf:getId(@ID,'word',.)}" title="{mf:getBox(@HEIGHT,@WIDTH,@VPOS,@HPOS)}">
<xsl:value-of select="@CONTENT"/>
<xsl:value-of select="preceding-sibling::HYP/@CONTENT"/>
</span>
</xsl:template>
But this is untested...
Ping @kba . Here are two examples of ALTO files with hyphenations (just search for HYP
node):
It looks that CONTENT
is empty in our HYP
tags, i.e. we may have to do something like an conditional statement.
Here's a small change to the alto2hocr.xsl script that should accomplish this: https://github.com/kba/hOCR-to-ALTO/commit/f447acea8cc90f8085e031d058626968475f4a0a
This https://github.com/kba/hOCR-to-ALTO/commit/f447acea8cc90f8085e031d058626968475f4a0a has been merged - so this issue might be closed...
I wasn't sure anymore if this was already included. Thank you for the confirmation @filak !
Example (the dash is actually normally an em dash):
will be transformed into
and the dash is missing. For correct presentation (and error computation) it should be part of the first word here. Can this been easily done with the XSLT approach here?