dracor-org / dracor-schema

ODD and schemas for dracor.org files
https://dracor.org/doc/odd
5 stars 2 forks source link

Revamp <sourcDesc> #38

Closed lehkost closed 2 years ago

lehkost commented 2 years ago

To even out differences regarding the encoding of <sourceDesc>, we propose the following structure (emaple data used is not meaningful):

<sourceDesc>
  <bibl type="digitalSource">
    <publisher>TextGrid Repository</publisher>
    <ref target="http://www.textgridrep.org/textgrid:rksp.0" />
    <availability status="free">
      <p>The full text of this work is in the public domain.</p>
    </availability>
  </bibl>
  <bibl type="printEdition">
    <title>Gotthold Ephraim Lessing: Werke. Herausgegeben von Herbert G. Göpfert in Zusammenarbeit mit Karl Eibl, Helmut Göbel, Karl S. Guthke, Gerd Hillen, Albert von Schirmding und Jörg Schönert, Band 1–8, München: Hanser, 1970 ff.</title>
  </bibl>
  <bibl type="firstEdition">
    <date type="print" when="1656" />
    <date type="premiere" />
    <date type="written" when="1589 />
  </bibl>
</sourceDesc>

So, for GerDraCor (and most other corpora) this would mean to rename <bibl type="originalSource"> to <bibl type="printEdition">. And it would mean to introduce a new <bibl type="firstEdition"> element which comprises all the fields for dates (to decouple it from the "printEdition" information). Also, <name>TextGrid Repository</name> + <idno type="URL">http://www.textgridrep.org/textgrid:rksp.0</idno> would change to <publisher>TextGrid Repository</publisher> + <ref target="http://www.textgridrep.org/textgrid:rksp.0" />.

An interesting case for comparison is the documentation of the ELTeC header.

ingoboerner commented 2 years ago

The example above contains two elements <bibl> of the @type "printEdition". Presumably, one should be firstEdition as proposed via E-Mail:

Die Hauptidee ist, dass wir ein neues Element einführen, in das alle drei -Elemente wandern. I am totally fine with including <date type="print"/> and <date type="written" />; Would including <date type="premiere" /> explicitly mean, that the "source" for the first staging of a play IS actually the first edition? This might not always be the case and we had that problem before as well..

ingoboerner commented 2 years ago

... and how would we normalize, if corpora included very detailed encoding of the sources. @peertrilcke proposed to use <biblStruct>.

ingoboerner commented 2 years ago

I just checked, where it would probably affect the code of the still unfinished rdf-module: renaming will probably affect the extractor-functions, e.g. in the util-module: here for example: dutil:get-play-info https://github.com/dracor-org/dracor-api/blob/main/modules/util.xqm#L856-L857 would the the key of the map "originalSource" change as well? see: https://github.com/dracor-org/dracor-api/blob/main/modules/util.xqm#L944 which would in turn affect modules, that use the extractor functions, e.g. https://github.com/dracor-org/dracor-api/blob/clscor_rdf/modules/rdf.xqm#L1428 https://github.com/dracor-org/dracor-api/blob/clscor_rdf/modules/rdf.xqm#L1058 ...

lehkost commented 2 years ago

The example above contains two elements of the @type "printEdition".

Yes, my mistake, it's corrected above, the other is indeed "firstEdition".

lehkost commented 2 years ago

... and how would we normalize, if corpora included very detailed encoding of the sources. @peertrilcke proposed to use <biblStruct>.

Quote: "<bibl> contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged." (TEI Guidelines) – so I think we can stick to <bibl> because it could be semantically tagged/structured as well. I'm not really sure about what <biblStruct> would add except for a specified order?

lehkost commented 2 years ago

I am totally fine with including <date type="print"/> and <date type="written" />; Would including <date type="premiere" /> explicitly mean, that the "source" for the first staging of a play IS actually the first edition? This might not always be the case and we had that problem before as well..

Yes, true. Semantically, except for the firstEdition date, the other dates would be information not about the firstEdition, but about the work itself (when was it written, when was it staged for the first time). Not sure where else to put this data.

Same goes for the work-related Wikidata ID. Right now, we place it under <publicationStmt>, but it actually is a pointer to the actual work as-such, not our version of it.

So, premiere date, written date and Wikidata ID are information independent from our TEI version of a play. Where would be the natural place for such info without confusing things too much?

lehkost commented 2 years ago

would the the key of the map "originalSource" change as well? see: https://github.com/dracor-org/dracor-api/blob/main/modules/util.xqm#L944

Good question. I think this is were it starts to complicate things. ;) We could leave "originalSource" as is if the impact of changing it is too big, of course…

ingoboerner commented 2 years ago

Don't know for the TEI, really. rdf serialization will create:

Work-Level (here I add owl:sameAs to wikidata):

<https://dracor.org/entity/ger000088/work> a frbroo:F14_Individual_Work ;
    rdfs:label "Emilia Galotti [Work]" ;
    frbroo:R19i_was_realised_through <https://dracor.org/entity/ger000088/creation/0> ;
    frbroo:R40_has_representative_expression <https://dracor.org/entity/ger000088/expression/2> ;
    frbroo:R66i_had_a_performed_version_through <https://dracor.org/entity/ger000088/performance/premiere> ;
    frbroo:R9_is_realised_in <https://dracor.org/entity/ger000088/expression/1>,
        <https://dracor.org/entity/ger000088/expression/2> ;
    owl:sameAs <http://www.wikidata.org/entity/Q782653> .

at least two entites on expression-level:

First publication:

<https://dracor.org/entity/ger000088/expression/1> a frbroo:F22_Self-Contained_Expression ;
    rdfs:label "Emilia Galotti [Text of first publication; Expression]" ;
    frbroo:R9i_realises <https://dracor.org/entity/ger000088/work> ;
    crm:P165i_is_incorporated_in <https://dracor.org/entity/ger000088/publication-expression/1> ;
    crm:P3_has_note "The text of 'Emilia Galotti' as found in its first publication." .

the "Text" of the printed source, that is included in DraCor (could be the same, but must not):

<https://dracor.org/entity/ger000088/expression/2> a frbroo:F22_Self-Contained_Expression ;
    rdfs:label "Emilia Galotti [Expression]" ;
    frbroo:R40i_is_representative_expression_for <https://dracor.org/entity/ger000088/work> ;
    frbroo:R4_carriers_provided_by <https://dracor.org/entity/ger000088/manifestation/2> ;
    frbroo:R9i_realises <https://dracor.org/entity/ger000088/work> ;
    crm:P165i_is_incorporated_in <https://dracor.org/entity/ger000088>,
        <https://dracor.org/entity/ger000088/digitalsource/1>,
        <https://dracor.org/entity/ger000088/file/tei/in>,
        <https://dracor.org/entity/ger000088/file/tei/out>,
        <https://dracor.org/entity/ger000088/publication-expression/2> ;
    crm:P3_has_note "The text of 'Emilia Galotti' as found in the edition: Gotthold Ephraim Lessing: Werke. Herausgegeben von Herbert G. Göpfert in Zusammenarbeit mit Karl Eibl, Helmut Göbel, Karl S. Guthke, Gerd Hillen, Albert von Schirmding und Jörg Schönert, Band 1–8, München: Hanser, 1970 ff." .

The Performance is an entity itself, which R66_included_performed_version_of the work. There is no direct link to the text included in DraCor:

<https://dracor.org/entity/ger000088/performance/premiere> a frbroo:F31_Performance ;
    rdfs:label "Premiere of Lessing, Gotthold Ephraim: Emilia Galotti. Ein Trauerspiel in fünf Aufzügen" ;
    frbroo:R66_included_performed_version_of <https://dracor.org/entity/ger000088/work> ;
    crm:P2_has_type <https://dracor.org/entity/type/performance/premiere> ;
    crm:P4_has_time-span <https://dracor.org/entity/ger000088/performance/premiere/ts> .

The "written"-info is super difficult, if I want to date it. I use two "activities" for that:

<https://dracor.org/entity/ger000088/creation/0> a frbroo:F28_Expression_Creation ;
    rdfs:label "Writing of 'Emilia Galotti', until it was first published." ;
    frbroo:R19_created_a_realization_of <https://dracor.org/entity/ger000088/work> ;
    crm:P134i_was_continued_by <https://dracor.org/entity/ger000088/creation/0/end> ;
    crm:P14_carried_out_by <https://dracor.org/entity/Q34628> ;
    crm:P2_has_type <https://dracor.org/entity/type/activity/writing> ;
    crm:P4_has_time-span <https://dracor.org/entity/ger000088/creation/0/ts> .

I can only really "date" the finishing activity, not really the whole process of "writing":

<https://dracor.org/entity/ger000088/creation/0/end> a crm:E7_Activity ;
    rdfs:label "Finishing writing 'Emilia Galotti', resulting in the text being ready for it's first publication." ;
    crm:P134_continued <https://dracor.org/entity/ger000088/creation/0> ;
    crm:P2_has_type <https://dracor.org/entity/type/activity/finishing> ;
    crm:P4_has_time-span <https://dracor.org/entity/ger000088/creation/0/end/ts> .

Not sure, if this really can/should go into the TEI. It's probably not the very best format to capture information about the work and surrounding events, e.g. performances.

lehkost commented 2 years ago

Thanks for the rdf serializations… I agree, it is difficult info to place in a TEI document. But for our standard analyses we need this info, especially to put plays in chronological order. I think we shouldn't just rely on other sources for this (plus, a lot of work has gone into looking all this up on our side).

So maybe, for lack of better ideas, we can put all three date types into the bibl-"firstEdition", we should just document it well. Wikidata ID for plays can also stay where it is, as long as there's documentation, too, and our API knows where to find it…

ingoboerner commented 2 years ago

I was thinking about the encoding, and I have to say: To me the nested <bibl>-Elements somehow make perfekt sense, because they actually convey the information explicitly, that there is a "digitalSource" that was transformed to the DraCor-File (outer <bibl>). This source is based on a printed edition of a text (hence called "originalSource" (inner <bibl>). If we put the <bibl> on the same level, we would loose this explicit relation between the sources. That would be an argument for the nesting.

Concerning the renaming of "originalSource" into "firstEdition": I agree, it makes sense, if it's always the first edition that has been digitized and used as the source for the <text> included in the File, but this might not always be the case. So basically, by adding information on the "first edition" into the header as a "source" we might obfuscating the provenance of the text. In my option, in the <sourceDesc> should only be recorded the actual sources of the text and if the first edition was not used (because another edition was the source of the "digitalSource"), it shouldn't be put there just for reasons to retrieve metadata on the "work". Naive as I am, I would say, that the most general case actually is:

Most general case:

<bibl type="digitalSource">
<!-- based on: -->
    <bibl type="printedSource">
    </bibl>
</bibl>

Special case, when the "first edition" was used as a source in the digitization:

<bibl type="digitalSource">
<!-- based on: -->
    <bibl type="printedSource" subtype="firstEdition">
    </bibl>
</bibl>

an alternative would be to "tag" the nested <bibl> making use of the @ana-Attribute. @subtype could also hold other values, e.g. "criticalEdition" or whatever other printed sources we might encounter.

As for the dates and all the information on the work, I think, we don't have to rush and look up other alternatives. It could also be a <note> in a <notesStmt> that would hold the information on the relevant dates. In this case, we wouldn't have to somewhat abuse the <sourceDesc> for that. Or look into the <profileDesc>, e.g. into <creation> more closely. Maybe the <profileDesc> would be more apt to hold the dates.

ingoboerner commented 2 years ago

https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-standOff.html could be another option for the "works"; would go after the <teiHeader> and include a (to be defined) <listWork> (should check, I saw this invented element in Peter Stadler's TEIs) containing the <work>; there we could include all the <date> elements.

see issue: https://github.com/dracor-org/dracor-schema/issues/39

lehkost commented 2 years ago

Thanks for all the good ideas & sorry for the long pause on my end… These suggestions make perfect sense to me:

The latter would include our three date types plus the Wikidata link to the work (in general, all information on a work, not on a specific edition). Thinking about it, maybe als genre information (<textClass>) could go there, but not necessarily.

Proposition:

<idno type="wikidata" xml:base="http://www.wikidata.org/entity/">Q51529377</idno>
<date type="print" when="1813"/>
<date type="premiere" when="1811"/>
<date type="written" when="1811">wahrscheinlich im Winter 1811</date>

We could include this between <fileDesc> and <profileDesc>, but I'd prefer somewhere in the <teiHeader> as this is where information like this will be looked up. Any other thoughts on placing this?

The big advantage of this solution would also be that we can now better structure bibliographic information, so we can either stick to a string within <title> or we have detailed info like in ItaDraCor, which is no longer convoluted with information on premiere/print/written.

cmil commented 2 years ago

I like the idea of using nested <bibl>s in sourceDesc and keeping it strictly bibliographic.

I also think it's a good idea to keep the date type information separate from the bibliographical information, and standOff may be an appropriate choice. It seems to come with a few restrictions though.

The content model of the <standOff> container would not allow the above proposed elements idno and date as immediate children. For the dates maybe a listEvent could be used like this

<listEvent>
  <event type="print" when="1813">
    <label>Druck</label>
  </event>
  <event type="premiere" when="1811">
    <label>Premiere</label>
  </event>
  <event type="written" when="1811">
    <desc>geschrieben wahrscheinlich im Winter 1811</desc>
  </event>
</listEvent>

Admittedly this would be a bit verbose, but unfortunately the event elements cannot be empty.

I'm not sure what would be a good alternative for idno to put into standOff.

We could include this between <fileDesc> and <profileDesc>, but I'd prefer somewhere in the <teiHeader> as this is where information like this will be looked up. Any other thoughts on placing this?

The <standOff> container can only be a sibling of the <teiHeader>, see https://tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SASOstdf.

As an alternative to standOff we could also consider xenoData which would allow us to use an XML vocabulary of our choice. This is what the EarlyPrint TEIs are using.

lehkost commented 2 years ago

Thanks for assembling these possibilities!

So with <xenoData> we could be maximally brief:

<xenoData>
  <wikidata>Q782653</wikidata>
  <date type="print" when="1766">1766</a>
  <date type="premiere" />
  <date type="written" notBefore="1757" notAfter="1758" />
</xenoData>

We could place it as next sibling of <teiHeader>.

The <standOff> solution really looks a bit verbose for our purposes, plus there's the question of how to appropriately squeeze in the Wikidata ID.

cmil commented 2 years ago

The xenoData element is supposed to be a child of teiHeader. Apparently it can contain only one direct child element and this element cannot be in the TEI namespace (at least that's what Oxygen seems to enforce). So something like this could work:

<xenoData>
  <work xmlns="">
    <wikidata-id>Q782653</wikidata-id>
    <date type="print" when="1766"/>
    <date type="written" notBefore="1757" notAfter="1758"/>
  </work>
</xenoData>

I would suggest to omit date elements for which we don't have actual data.

Instead of using the empty namespace we could formally define our own. And we should probably add appropriate checks to the Schematron.

On the other hand we could shorten the standOff a bit by just inserting empty desc elements to satisfy the schema. And we could maybe use the link element for the Wikidata ID:

<standOff>
  <listEvent>
    <event type="print" when="1813"><desc/></event>
    <event type="premiere" when="1811"><desc/></event>
    <event type="written" when="1811">
      <desc>geschrieben wahrscheinlich im Winter 1811</desc>
    </event>
  </listEvent>
  <link type="wikidata" target="http://www.wikidata.org/entity/Q51529377"/>
</standOff>

It would have the advantage that we don't have to invent our own meta data markup.

ingoboerner commented 2 years ago

I was puzzled about this new <link> element in the <standOff> container. I think, it's not possible to use it that way. Please look at the available example in the TEI here: https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-link.html You would need at least two references to establish a link, see <link target="#R1 #R3 #R4"/>; this is explicitly enforced by schematron in tei-all, see the "Schematron" box:

<sch:assert test="contains(normalize-space(@target),' ')">You must supply at least two values for @target or on <sch:name/>
</sch:assert>
lehkost commented 2 years ago

Oh, okay! So let's revamp our solution another time. 😊

Propositions:

  1. Add xml:id to TEI root element: <TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="ger000171" xml:lang="ger">, then integrate this ID in the standOff part: <link target="#ger000550 http://www.wikidata.org/entity/Q42187688"/>.
  2. Use <relation active="#ger000171" passive="http://www.wikidata.org/entity/Q42187688" name="wikidata"/> within <listRelation>, which is allowed in standOff.
  3. Another one?
cmil commented 2 years ago

After (re)reading the TEI documentation for link and relation I would say relation would be more appropriate. The link element really seems to be meant for linking elements inside a document which is not what we are doing here. (Not sure how I missed that the first time.)

If we want to use the short form active="#ger000171", I think, we would still have to add the xml:id to the TEI element. Alternatively we could probably use a full URI. Would that be e.g. https://dracor.org/entity/ger000171 then?

lehkost commented 2 years ago

Full DraCor URL would be okay, too, I think. But in this case /id/, not /entity/ for DraCor links, isn't it? (i.e., https://dracor.org/id/ger000171)

But just to be sure, would there be another advantage beyond solving our current problem if we gave the TEI root element our xml:id? A somewhat unwelcome side effect would be that we would double the information in our own files since play IDs are already stored within <publicationStmt>, no?

cmil commented 2 years ago

I would assume we would relate the DraCor entity to the Wikidata one, that's why /entity/ in both cases.

And if we add the DraCor ID to the root element we might even drop the idno from publicationStmt.

But @ingoboerner should probably have the last word on that.

lehkost commented 2 years ago

Ok, after a quick discussion, this seems to be our final verdict:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="ger000171" xml:lang="ger">

and

<listRelation>
  <relation active="https://dracor.org/entity/ger000171" passive="http://www.wikidata.org/entity/Q42187688" name="wikidata"/>
</listRelation>
cmil commented 2 years ago

@lehkost @ingoboerner Would we then deprecate <idno type="dracor" xml:base="https://dracor.org/id/">ger000171</idno> in publicationStmt or make it optional?

lehkost commented 2 years ago

Yes, we would take this out so as not to double the information. Is there anything we would lose by doing so in your opinion?

cmil commented 2 years ago

I can't think of anything at the moment.

lehkost commented 2 years ago

Then let's go. 😬