Closed michaelkubina closed 1 year ago
"A composedBlock can consist of both a textBlock and a composedBlock ..." - Would you mind sharing a test file having the mentioned elements ?
You can find one example here: https://img.sub.uni-hamburg.de/kitodo/PPN1699277745_19140222/00000001.xml
ComposedBlock ID="Page1_Block14"
consists of one shape and two composedBlock
children, that then have their own textBlock
elements. I just found this, because a PDF was missing two columns on its first page:
But there are certainly more cases..page 3 as well with ComposedBlock ID="Page1_Block15"
: https://img.sub.uni-hamburg.de/kitodo/PPN1699277745_19140222/00000003.xml
The french national lilbrary shows this diagram for the alto-schema's block-group (https://www.loc.gov/standards/alto/alto.xsd):
Source: http://bibnum.bnf.fr/alto_prod/documentation/alto_prod.html
Thank you. This might be easy to fix in alto__hocr.xsl with:
*:TextBlock|*:ComposedBlock
<xsl:template match="*:ComposedBlock">
<div class="ocr_carea" id="{mf:getId(@ID,'block',.)}" title="{mf:getBox(@HEIGHT,@WIDTH,@VPOS,@HPOS,@WC)}">
<xsl:apply-templates select="*:TextBlock|*:ComposedBlock"/>
</div>
</xsl:template>
I ran the modified XSL with your test file and the output seems fine to me.
Could you please confirm the output is OK for you ?
Thank you very much, i can confirm, that the output is now complete...no more missing text.
This fixed it.
A conversion from ALTO to hOCR through
alto__hocr.xsl
turns out incomplete in the case, that acomposedBlock
element has one or more othercomposedBlock
child elements. Those children will not find their way into the newly generated hOCR-file, because there is no recursion for those elements and onlytextBlock
child-nodes get extracted. We get files structured like this from e.g. the ABBYY Recognition Server 4.0.A
composedBlock
can consist of both atextBlock
and acomposedBlock
(and others, that are not important for text).