Open ingoboerner opened 3 months ago
On TEI Panorama we have 7 types of entities which we encode with <name>
<phr>
<title>
They are annotated manually, so there is a high score of marking them even if they are not mentioned explicitly by name but also for example in a form of invective, our dearest friend etc.
In Samuel Zborowski drama we have people, place and organisation types of entities, but in next plays we will have more types. All of them are external from data you are already collecting (for example characters talking about Poland Kraków in this play, but you don't gather those data so far about places) - we have IDs for them in our base on TEI Panorama + people also have WIKIID if possible (other types in the future ;))
So the question is if we wipe out this data (and be lost for DraCor) while transformation TEI Panorama schema to DraCor schema or it should be converted via Python script to <rs>
element and added to ODD?
@aszulinska maybe have a look at this corpus that is derived from the German Drama Corpus: https://github.com/quadrama/gerdracor-coref
In this corpora (and tools tested on German texts annotation) they use this encoding:
<sp who="#sara">
<rs ref="#sara"><speaker>SARA.</speaker></rs>
<p> Ich habe <rs ref="#ein_gewisses_vermächtnis"><rs ref="#ein_gewisses_vermächtnis">es</rs></rs> nicht vergessen, <rs ref="#mellefont"><name ref="#mellefont">Mellefont</name></rs>. <rs ref="#mellefont"><rs ref="#mellefont">Sie</rs></rs> wollen vorher <rs ref="#ein_gewisses_vermächtnis"><rs xml:id="ein_gewisses_vermächtnis">ein gewisses Vermächtnis</rs></rs> retten. – <rs ref="#Sie-wollen-vorher-zeitliche-Güter-retten"><rs ref="#mellefont"><rs ref="#mellefont">Sie</rs></rs> wollen vorher <rs xml:id="zeitliche_güter">zeitliche Güter</rs> retten</rs>, und mich vielleicht <rs ref="#ewige_güter"><rs xml:id="ewige_güter">ewige</rs></rs> <rs ref="#Sie-wollen-vorher-zeitliche-Güter-retten">darüber</rs> verscherzen lassen.</p>
</sp>
From this play
It's hard to me to get what they annotate apart from speakers (what type of data), but the coding with <rs ref="#sara">
was exactly what @ingoboerner proposed in #55, so it's fine with us?
In this corpora (and tools tested on German texts annotation) they use this encoding:
<sp who="#sara"> <rs ref="#sara"><speaker>SARA.</speaker></rs> <p> Ich habe <rs ref="#ein_gewisses_vermächtnis"><rs ref="#ein_gewisses_vermächtnis">es</rs></rs> nicht vergessen, <rs ref="#mellefont"><name ref="#mellefont">Mellefont</name></rs>. <rs ref="#mellefont"><rs ref="#mellefont">Sie</rs></rs> wollen vorher <rs ref="#ein_gewisses_vermächtnis"><rs xml:id="ein_gewisses_vermächtnis">ein gewisses Vermächtnis</rs></rs> retten. – <rs ref="#Sie-wollen-vorher-zeitliche-Güter-retten"><rs ref="#mellefont"><rs ref="#mellefont">Sie</rs></rs> wollen vorher <rs xml:id="zeitliche_güter">zeitliche Güter</rs> retten</rs>, und mich vielleicht <rs ref="#ewige_güter"><rs xml:id="ewige_güter">ewige</rs></rs> <rs ref="#Sie-wollen-vorher-zeitliche-Güter-retten">darüber</rs> verscherzen lassen.</p> </sp>
From this play It's hard to me to get what they annotate apart from speakers (what type of data), but the coding with
<rs ref="#sara">
was exactly what @ingoboerner proposed in #55, so it's fine with us?
This is not valid TEI since the sp
element cannot have rs
elements as a direct child element (https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-sp.html). And in the case sp/speaker
TEI already provides a way to identify the speaker by means of the who
attribute, so I don't even see the need to use an extra rs
element.
Also, the frequent double wrapping of text within rs
elements with the same ref
attribute looks like an artefact of poor automation to me. And I get 1381 errors when I open https://raw.githubusercontent.com/quadrama/gerdracor-coref/gold/tei/Sara.xml in Oxygen. All in all this particular document does not seem like a good example.
That said, I wood agree to enable the rs
element in the DraCor schema in places where TEI-all would allow it.
There might be scenarios in which someone would want to encode "mentions" of things/entities in the text of a play.