Closed wrznr closed 5 years ago
I do not see the ocr_carea in your output - why do you think that its absence causes the error?
Sorry, misunderstanding: Actually, I am complaining about the missing ocr_carea. The error is most likely not related!
filipk notifications@github.com schrieb am Mi. 3. Apr. 2019 um 18:05:
I do not see the ocr_carea in your output - why do you think that its absence causes the error?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/filak/hOCR-to-ALTO/issues/10#issuecomment-479554233, or mute the thread https://github.com/notifications/unsubscribe-auth/AZeFE0GaHCxl_htge8hffqBPiaZGTF0Xks5vdNE9gaJpZM4caxKe .
Well, your input ALTO file does not contain any ComposedBlock elements - so there is no content to transform into ocr_carea...
<xsl:template match="PrintSpace">
<xsl:apply-templates select="ComposedBlock"/>
<xsl:apply-templates select="TextBlock"/>
</xsl:template>
<xsl:template match="ComposedBlock">
<div class="ocr_carea" id="{mf:getId(@ID,'block',.)}" title="...">
<xsl:apply-templates select="TextBlock"/>
</div>
</xsl:template>
That's exactly the point! Sorry for not making this clear in the first place: As far as I understand the hOCR specs, every text segment has to be enclosed by an ocr_carea
(not only composed blocks). Though maybe I am wrong at this... (@kba FYI)
The docs are a bit unclear. But IMHO I think you cannot create ocr_carea without the respective ALTO elements.
From my point it is not a bug in the transformation so I am closing this.
As far as I understand the hOCR specs, every text segment has to be enclosed by an
ocr_carea
That should be the case but as @filak said, it is underspecified at least. Do not rely on that for transformations :-(
@wrznr I am also getting an essentially empty hocr (when running ocr-transform) from an ABBYY-outputted alto file. Did you manage to find a way to do the conversion? Thanks!
@jtlz2 No. But this is actually not my use case. I plan to go from ALTO to TEI and since I have a method to convert hOCR to TEI, I thought I could use this script as an intermediate step. Due to the unclearness of the hOCR documentation (cf. above), I refrained from this idea.
For your use case, maybe https://gist.github.com/tfmorris/5977784 helps?
Using
alto2hocr.xsl
on this alto file viaocr-fileformat
results in probably invalid hOCR since it missesocr_carea
. According to the specification, all parts of the text should be contained in such an element.Validating the resulting hOCR file results in
(which is most likely a different problem).