UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
181 stars 22 forks source link

alto2hocr: Hyphenation sign is not handled correctly #6

Closed zuphilip closed 4 years ago

zuphilip commented 8 years ago

Example (the dash is actually normally an em dash):

   ...
   <String WC="0.6233333349" CONTENT="con" HEIGHT="26" WIDTH="79" VPOS="132" HPOS="596" SUBS_TYPE="HypPart1" SUBS_CONTENT="conservation"/>
   <HYP CONTENT="­-­"/>
</TextLine>
<TextLine HEIGHT="43" WIDTH="679" VPOS="175" HPOS="12">
   <String WC="0.7411110997" CONTENT="servation" HEIGHT="43" WIDTH="194" VPOS="175" HPOS="12" SUBS_TYPE="HypPart2" SUBS_CONTENT="conservation"/>
   ...

will be transformed into

   ...
   <span class="ocrx_word" id="word_d1e73" title="bbox 596 132 675 158">con</span>
</span>
<span class="ocr_line" id="line_d1e76" title="bbox 12 175 691 218">
   <span class="ocrx_word" id="word_d1e77" title="bbox 12 175 206 218">servation</span>
   ...

and the dash is missing. For correct presentation (and error computation) it should be part of the first word here. Can this been easily done with the XSLT approach here?

stweil commented 8 years ago

Obviously ALTO uses a <HYP> tag which is currently not handled by the transformation. I did not find a similar tag in the hOCR specification.

http://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/hocr/417576986_0012.hocr is the result of an extended style sheet where I used a quick hack to produce something useful.

zuphilip commented 8 years ago

Hm.. I see an extra node in your result which results in another space. Our goal is to receive

<span class="ocrx_word" id="word_d1e73" title="bbox 596 132 675 158">con-</span>

I think we can try to replace these lines with something like

 <xsl:template match="String">
    <span class="ocrx_word" id="{mf:getId(@ID,'word',.)}" title="{mf:getBox(@HEIGHT,@WIDTH,@VPOS,@HPOS)}">
        <xsl:value-of select="@CONTENT"/>
        <xsl:value-of select="preceding-sibling::HYP/@CONTENT"/>
     </span>
  </xsl:template>

But this is untested...

zuphilip commented 8 years ago

Ping @kba . Here are two examples of ALTO files with hyphenations (just search for HYP node):

It looks that CONTENT is empty in our HYP tags, i.e. we may have to do something like an conditional statement.

kba commented 8 years ago

Here's a small change to the alto2hocr.xsl script that should accomplish this: https://github.com/kba/hOCR-to-ALTO/commit/f447acea8cc90f8085e031d058626968475f4a0a

filak commented 4 years ago

This https://github.com/kba/hOCR-to-ALTO/commit/f447acea8cc90f8085e031d058626968475f4a0a has been merged - so this issue might be closed...

zuphilip commented 4 years ago

I wasn't sure anymore if this was already included. Thank you for the confirmation @filak !