filak / hOCR-to-ALTO

Convert between Tesseract hOCR and ALTO XML using XSL stylesheets
MIT License
53 stars 14 forks source link

[BUG] Incomplete transformation from alto to hocr #27

Closed michaelkubina closed 1 year ago

michaelkubina commented 1 year ago

A conversion from ALTO to hOCR through alto__hocr.xsl turns out incomplete in the case, that a composedBlock element has one or more other composedBlock child elements. Those children will not find their way into the newly generated hOCR-file, because there is no recursion for those elements and only textBlock child-nodes get extracted. We get files structured like this from e.g. the ABBYY Recognition Server 4.0.

A composedBlock can consist of both a textBlock and a composedBlock (and others, that are not important for text).

filak commented 1 year ago

"A composedBlock can consist of both a textBlock and a composedBlock ..." - Would you mind sharing a test file having the mentioned elements ?

michaelkubina commented 1 year ago

You can find one example here: https://img.sub.uni-hamburg.de/kitodo/PPN1699277745_19140222/00000001.xml

ComposedBlock ID="Page1_Block14" consists of one shape and two composedBlock children, that then have their own textBlock elements. I just found this, because a PDF was missing two columns on its first page:

alot_hocr_incomplete

But there are certainly more cases..page 3 as well with ComposedBlock ID="Page1_Block15": https://img.sub.uni-hamburg.de/kitodo/PPN1699277745_19140222/00000003.xml

alot_hocr_incomplete_2

The french national lilbrary shows this diagram for the alto-schema's block-group (https://www.loc.gov/standards/alto/alto.xsd):

Source: http://bibnum.bnf.fr/alto_prod/documentation/alto_prod.html bnf

filak commented 1 year ago

Thank you. This might be easy to fix in alto__hocr.xsl with:

 *:TextBlock|*:ComposedBlock
  <xsl:template match="*:ComposedBlock">
    <div class="ocr_carea" id="{mf:getId(@ID,'block',.)}" title="{mf:getBox(@HEIGHT,@WIDTH,@VPOS,@HPOS,@WC)}">
         <xsl:apply-templates select="*:TextBlock|*:ComposedBlock"/>
     </div>
  </xsl:template>

I ran the modified XSL with your test file and the output seems fine to me.

Could you please confirm the output is OK for you ?

michaelkubina commented 1 year ago

Thank you very much, i can confirm, that the output is now complete...no more missing text.

This fixed it.