dracor-org / georgdracor

Georgian Drama Corpus
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

xml:ids SHOULD be latin characters #1

Open ingoboerner opened 1 month ago

ingoboerner commented 1 month ago

In the current test file there are already attributes @xml:ids for characters <person> in the <particDesc>. They are in Georgina script, which seems not be be a problem for the wellformedness of the XML though. I validated the file with our new upcoming schema that includes some schematron checks and they report issues with the connection of character labels in <speaker>/ @who attributes and the (not) corresponding in the elements in the <particDesc>.

For example, there is a speech:

<sp who="#ანდუყაფარ">
            <speaker>ანდუყაფარ</speaker>
            <p>რა ვქნა, ძმაო, არა შვრება. რა გაეწყობა, ვერა მომიხერხებია რა, რუსული კარგად იცის,
              რუსის დიდკაცებთან დადის, მერე ჩემი რძალიც იმისკენ არის.</p>
          </sp>

The schema check reports:

Beschreibung: A speech act SHOULD link to a 'person' or 'personGrp' element in 'particDesc'. Use a valid character ID and provide it as a pointer by prepending it with a hash '#'."

If I STRG+F for ანდუყაფარ I can not find the exact string in the ids in the particDesc. There is a somewhat close match:

<person xml:id="ანდუყაფარი" sex="MALE">
            <persName>ანდუყაფარი</persName>
          </person>

Still, not being able the read the text AND the IDs it is hard for me to tell if that' simply a spelling error in the ID.

It would be easier to check if the values of the @xml:ids where latin transliterations of the Georgian ones. In other corpora that use non-latin scripts, e.g. the Russian and the Ukrainian Drama Corpus we have transliterated IDs, e.g.

<person xml:id="hlestakov" sex="MALE">
            <persName>Хлестаков</persName>
            <persName xml:lang="ger">Chlestakow</persName>
          </person>
<personGrp sex="UNKNOWN" xml:id="khor_ditej">
            <name>Хор дітей</name>
          </personGrp>

It is also the case for the Hebrew Drama Corpus:

<person xml:id="mkhls_ildim">
      <persName>
       מקהלת ילדים
      </persName>
cmil commented 3 weeks ago

Just to prevent misunderstandings: the Hebrew example actually features an invalid ID. XML IDs need to be so-called Nmtokens which cannot contain space characters among others (see full spec at https://www.w3.org/TR/REC-xml/#NT-Name). Technically they can consist of characters from various scripts, but for DraCor we follow the convention to use only ASCII characters. The reason being that IDs may be used to construct URLs and anything beyond the ASCII range would have to be URI encoded which would make the URLs less readable.