clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

ParlaMint2release breaks CZ named entities #702

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

This tries to fix invalid named entities, but it breaks nested entities: https://github.com/clarin-eric/ParlaMint/blob/c68b7fe2bc4f27a85e8e8dbe08545035d0dc179e/Scripts/parlamint2release.xsl#L433-L439

should be:

  <!-- Bug where a name contains no words, but only a transcriber comment: remove <name> tag -->
  <xsl:template mode="comp" match="tei:body//tei:name[not(.//tei:w)]">
    <xsl:message select="concat('WARN ', /tei:TEI/@xml:id, 
                         ': removing name tag as name ', normalize-space(.), 
             ' contains no words for ', ancestor-or-self::tei:*[@xml:id][1]/@xml:id)"/>
    <xsl:apply-templates mode="comp"/>
  </xsl:template>

tei:w changed to .//tei:w (looking for all descendants)

TomazErjavec commented 1 year ago

Thanks for noticing, fixed now. As discussed it might be a good idea to remove special CZ NEs anyway.