dracor-org / dracor-schema

ODD and schemas for dracor.org files
https://dracor.org/doc/odd
5 stars 2 forks source link

Referencing entities external to the text #55

Open ingoboerner opened 3 months ago

ingoboerner commented 3 months ago

There might be scenarios in which someone would want to encode "mentions" of things/entities in the text of a play.

aszulinska commented 3 months ago

On TEI Panorama we have 7 types of entities which we encode with <name> <phr> <title>

They are annotated manually, so there is a high score of marking them even if they are not mentioned explicitly by name but also for example in a form of invective, our dearest friend etc.

In Samuel Zborowski drama we have people, place and organisation types of entities, but in next plays we will have more types. All of them are external from data you are already collecting (for example characters talking about Poland Kraków in this play, but you don't gather those data so far about places) - we have IDs for them in our base on TEI Panorama + people also have WIKIID if possible (other types in the future ;))

So the question is if we wipe out this data (and be lost for DraCor) while transformation TEI Panorama schema to DraCor schema or it should be converted via Python script to <rs> element and added to ODD?

ingoboerner commented 3 months ago

@aszulinska maybe have a look at this corpus that is derived from the German Drama Corpus: https://github.com/quadrama/gerdracor-coref

aszulinska commented 3 months ago

In this corpora (and tools tested on German texts annotation) they use this encoding:

<sp who="#sara">
                <rs ref="#sara"><speaker>SARA.</speaker></rs>
                <p> Ich habe <rs ref="#ein_gewisses_vermächtnis"><rs ref="#ein_gewisses_vermächtnis">es</rs></rs> nicht vergessen, <rs ref="#mellefont"><name ref="#mellefont">Mellefont</name></rs>. <rs ref="#mellefont"><rs ref="#mellefont">Sie</rs></rs> wollen vorher <rs ref="#ein_gewisses_vermächtnis"><rs xml:id="ein_gewisses_vermächtnis">ein gewisses Vermächtnis</rs></rs> retten. – <rs ref="#Sie-wollen-vorher-zeitliche-Güter-retten"><rs ref="#mellefont"><rs ref="#mellefont">Sie</rs></rs> wollen vorher <rs xml:id="zeitliche_güter">zeitliche Güter</rs> retten</rs>, und mich vielleicht <rs ref="#ewige_güter"><rs xml:id="ewige_güter">ewige</rs></rs> <rs ref="#Sie-wollen-vorher-zeitliche-Güter-retten">darüber</rs> verscherzen lassen.</p>
              </sp>

From this play It's hard to me to get what they annotate apart from speakers (what type of data), but the coding with <rs ref="#sara"> was exactly what @ingoboerner proposed in #55, so it's fine with us?

cmil commented 3 months ago

In this corpora (and tools tested on German texts annotation) they use this encoding:

<sp who="#sara">
                <rs ref="#sara"><speaker>SARA.</speaker></rs>
                <p> Ich habe <rs ref="#ein_gewisses_vermächtnis"><rs ref="#ein_gewisses_vermächtnis">es</rs></rs> nicht vergessen, <rs ref="#mellefont"><name ref="#mellefont">Mellefont</name></rs>. <rs ref="#mellefont"><rs ref="#mellefont">Sie</rs></rs> wollen vorher <rs ref="#ein_gewisses_vermächtnis"><rs xml:id="ein_gewisses_vermächtnis">ein gewisses Vermächtnis</rs></rs> retten. – <rs ref="#Sie-wollen-vorher-zeitliche-Güter-retten"><rs ref="#mellefont"><rs ref="#mellefont">Sie</rs></rs> wollen vorher <rs xml:id="zeitliche_güter">zeitliche Güter</rs> retten</rs>, und mich vielleicht <rs ref="#ewige_güter"><rs xml:id="ewige_güter">ewige</rs></rs> <rs ref="#Sie-wollen-vorher-zeitliche-Güter-retten">darüber</rs> verscherzen lassen.</p>
              </sp>

From this play It's hard to me to get what they annotate apart from speakers (what type of data), but the coding with <rs ref="#sara"> was exactly what @ingoboerner proposed in #55, so it's fine with us?

This is not valid TEI since the sp element cannot have rs elements as a direct child element (https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-sp.html). And in the case sp/speaker TEI already provides a way to identify the speaker by means of the who attribute, so I don't even see the need to use an extra rs element.

Also, the frequent double wrapping of text within rs elements with the same ref attribute looks like an artefact of poor automation to me. And I get 1381 errors when I open https://raw.githubusercontent.com/quadrama/gerdracor-coref/gold/tei/Sara.xml in Oxygen. All in all this particular document does not seem like a good example.

That said, I wood agree to enable the rs element in the DraCor schema in places where TEI-all would allow it.