Open ingoboerner opened 1 month ago
Just to prevent misunderstandings: the Hebrew example actually features an invalid ID. XML IDs need to be so-called Nmtokens which cannot contain space characters among others (see full spec at https://www.w3.org/TR/REC-xml/#NT-Name). Technically they can consist of characters from various scripts, but for DraCor we follow the convention to use only ASCII characters. The reason being that IDs may be used to construct URLs and anything beyond the ASCII range would have to be URI encoded which would make the URLs less readable.
In the current test file there are already attributes
@xml:id
s for characters<person>
in the<particDesc>
. They are in Georgina script, which seems not be be a problem for the wellformedness of the XML though. I validated the file with our new upcoming schema that includes some schematron checks and they report issues with the connection of character labels in<speaker>
/@who
attributes and the (not) corresponding in the elements in the<particDesc>
.For example, there is a speech:
The schema check reports:
If I STRG+F for
ანდუყაფარ
I can not find the exact string in the ids in theparticDesc
. There is a somewhat close match:Still, not being able the read the text AND the IDs it is hard for me to tell if that' simply a spelling error in the ID.
It would be easier to check if the values of the
@xml:id
s where latin transliterations of the Georgian ones. In other corpora that use non-latin scripts, e.g. the Russian and the Ukrainian Drama Corpus we have transliterated IDs, e.g.It is also the case for the Hebrew Drama Corpus: