OCR4all / LAREX

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
MIT License
179 stars 33 forks source link

AlternativeImages are all moved to page level when saving #290

Closed bertsky closed 1 month ago

bertsky commented 2 years ago

I just found that the new PAGE-XML editing facility still has one bug: it does not retain AlternativeImage under their original segment (which could be region, line, word or even glyph), but moves them all to the top (page) level.

This is a big problem for OCR-D especially. (It means a region or line crop will be mistaken for a page image.)

bertsky commented 2 years ago

Also, I don't think LAREX should interfere with the namespace prefix. If the input used one, reuse it, otherwise, keep it default. If at all possible, even reuse the indentation width. (Think git controlling your data: you'd always want minimal changesets to be able to detect errors.)

maxnth commented 2 years ago

The behavior regarding AlternativImage elements and namespace prefixes is indeed kinda weird and definitely not how it's supposed to be. I just ran some tests and – if I'm not mistaken – it looks like both are directly introduced by prima-core-libs. Running e. g. JPageConverter also removes namespace prefixes and moves all AlternativeImage elements to the page level. On first glance It seems like there aren't any parameters to change this behavior as of now. I'll look into creating a PR with the desired fixes / changes for prima-core-libs.

bertsky commented 2 years ago

I'll look into creating a PR with the desired fixes / changes for prima-core-libs.

Fantastic! (Let me know if you have trouble setting up build env for the PRImA stuff.)

BTW, here's a repair script in XSL for anyone who has edited data this way. It assumes your AlternativeImage/@filename contains the segment @id though, and that identifiers contain region or line. (The latter can be adapted easily, the former would be very hard to improve.)

<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:pc="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15">
  <xsl:output method="xml" version="1.0" omit-xml-declaration="no" encoding="UTF-8" indent="yes"/>
  <xsl:strip-space elements="*"/>

<!-- suppress AlternativeImage on page level for sub-segments
     TODO: needs more elaborate filter than just fixed region/line strings
-->
<xsl:template match="/pc:PcGts/pc:Page/pc:AlternativeImage[
                     contains(@filename,'region') or
                     contains(@filename,'line')]"/>
<!-- copy AlternativeImage by matching filename
     TODO: needs more elaborate regex if identifiers are not directly contained in filenames or may clash
-->
<xsl:template match="*|@*">
  <xsl:copy>
    <xsl:apply-templates select="@*"/>
    <xsl:if test="@id">
      <xsl:variable name="identifier" select="@id"/>
      <xsl:for-each select="/pc:PcGts/pc:Page/pc:AlternativeImage">
        <xsl:if test="contains(@filename,$identifier)">
          <xsl:copy-of select="."/>
        </xsl:if>
      </xsl:for-each>
    </xsl:if>
    <xsl:apply-templates select="node()|text()"/>
  </xsl:copy>
 </xsl:template>
</xsl:stylesheet>
maxnth commented 1 month ago

Finally got the time to look into it and the PR here should hopefully fix this. Will push a custom build of PrimeDla to the LAREX dev branch after some further testing.

maxnth commented 1 month ago

Updated PrimaDla in 31258ab to include the fix for this issue. Non page-level alternative images should now be kept when updating / writing the PAGE XML (not when changing region types (e. g. from TextRegion to ImageRegion) yet but this is part of another issue we have to address).

In case this isn't working with your material or not working as expected please let me know and I'll reopen the issue.