filak / hOCR-to-ALTO

Convert between Tesseract hOCR and ALTO XML using XSL stylesheets
MIT License
53 stars 14 forks source link

Add "ocr_carea" to hOCR output (of alto2hocr.xsl) #10

Closed wrznr closed 5 years ago

wrznr commented 5 years ago

Using alto2hocr.xsl on this alto file via ocr-fileformat results in probably invalid hOCR since it misses ocr_carea. According to the specification, all parts of the text should be contained in such an element.

Validating the resulting hOCR file results in

$ ocr-validate hocr 00000011.html 
[WARN] STDIN Recommended metadata field 'ocr-langs' missing
[ERROR] STDIN:13 Error parsing properties for "<div class="ocr_page" id="Page1" title="image ; bbox 0 0  ; ppageno 0">" : (property need more than 1 value to unpack)

(which is most likely a different problem).

filak commented 5 years ago

I do not see the ocr_carea in your output - why do you think that its absence causes the error?

wrznr commented 5 years ago

Sorry, misunderstanding: Actually, I am complaining about the missing ocr_carea. The error is most likely not related!

filipk notifications@github.com schrieb am Mi. 3. Apr. 2019 um 18:05:

I do not see the ocr_carea in your output - why do you think that its absence causes the error?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/filak/hOCR-to-ALTO/issues/10#issuecomment-479554233, or mute the thread https://github.com/notifications/unsubscribe-auth/AZeFE0GaHCxl_htge8hffqBPiaZGTF0Xks5vdNE9gaJpZM4caxKe .

filak commented 5 years ago

Well, your input ALTO file does not contain any ComposedBlock elements - so there is no content to transform into ocr_carea...

    <xsl:template match="PrintSpace">
            <xsl:apply-templates select="ComposedBlock"/>
            <xsl:apply-templates select="TextBlock"/>
      </xsl:template>

     <xsl:template match="ComposedBlock">
         <div class="ocr_carea" id="{mf:getId(@ID,'block',.)}" title="...">
             <xsl:apply-templates select="TextBlock"/>
         </div>
     </xsl:template>
wrznr commented 5 years ago

That's exactly the point! Sorry for not making this clear in the first place: As far as I understand the hOCR specs, every text segment has to be enclosed by an ocr_carea (not only composed blocks). Though maybe I am wrong at this... (@kba FYI)

filak commented 5 years ago

The docs are a bit unclear. But IMHO I think you cannot create ocr_carea without the respective ALTO elements.

From my point it is not a bug in the transformation so I am closing this.

kba commented 5 years ago

As far as I understand the hOCR specs, every text segment has to be enclosed by an ocr_carea

That should be the case but as @filak said, it is underspecified at least. Do not rely on that for transformations :-(

jtlz2 commented 5 years ago

@wrznr I am also getting an essentially empty hocr (when running ocr-transform) from an ABBYY-outputted alto file. Did you manage to find a way to do the conversion? Thanks!

wrznr commented 5 years ago

@jtlz2 No. But this is actually not my use case. I plan to go from ALTO to TEI and since I have a method to convert hOCR to TEI, I thought I could use this script as an intermediate step. Due to the unclearness of the hOCR documentation (cf. above), I refrained from this idea.

For your use case, maybe https://gist.github.com/tfmorris/5977784 helps?