dracor-org / dracor-schema

ODD and schemas for dracor.org files
https://dracor.org/doc/odd
5 stars 2 forks source link

Split ODD into separate files and then combine in the compilation process? #66

Open ingoboerner opened 3 months ago

ingoboerner commented 3 months ago

As @cmil, @lehkost and me discussed briefly at the CCLS conference the current ODD file feels cluttered and is quite hard to maintain. We maybe want to rework it and modularize it so that we can adapt it for certain corpora more easily:

1) Maybe use "tei_all" in the first place as base (minimal requirement: validate against TEI all); include all that is there, i.e. exclude="" on moduleRef; maybe only from relevant modules, i.e.

<moduleRef key="core" except=""/>
<moduleRef key="tei" except=""/>
<moduleRef key="header" except=""/>
<moduleRef key="textstructure" except=""/>
<moduleRef key="drama" except=""/>
<moduleRef key="namesdates" except=""/>
<moduleRef key="corpus" except=""/>
<moduleRef key="linking" except=""/>
<moduleRef key="figures" except=""/>
<moduleRef key="analysis" except=""/>

compare to current version: https://github.com/dracor-org/dracor-schema/blob/00fb7ea86c11f47a0b871bc8fff9c30f891008fa/dracor.odd#L527-L550

This file might do nothing else.

2) We then need to restrict the usage of some elements or change the content model, the values of some attributes that are relevant to the API. These element changes will affect certain elements/attributes that we need to restrict because the API to some degree expects to find certain things (@ xml:id on root <TEI>) and is confused when there are some unexpected things, e.g. multiple <text> elements as in some swedracor files..

3) take (2) and add examples <exemplum> if we have them for certain elements, e.g. currently https://github.com/dracor-org/dracor-schema/blob/00fb7ea86c11f47a0b871bc8fff9c30f891008fa/dracor.odd#L771-L869 In the "examples odd" file we would do (maybe rework @source and include something with is based on a defined prefix in prefixDecl (or how the element is called):

<!-- author -->
                    <elementSpec ident="author" module="core" mode="change">
                        <exemplum source="#ger000546">
                            <egXML xmlns="http://www.tei-c.org/ns/Examples">
                                <author>
                                    <persName>
                                        <forename>Andreas</forename>
                                        <surname>Gryphius</surname>
                                    </persName>
                                    <idno type="wikidata">Q77214</idno>
                                    <idno type="pnd">118543032</idno>
                                </author>
                            </egXML>
                            <ab> Encoding of the author "Andreas Gryphius" of the play <ref
                                    target="https://dracor.org/id/ger000546">Leo Armenius oder
                                    Fürsten-Mord</ref>. </ab>
                        </exemplum>
                        <exemplum source="#rus000205">
                            <egXML xmlns="http://www.tei-c.org/ns/Examples">
                                <author>
                                    <persName>
                                        <forename>Владимир</forename>
                                        <forename type="patronym">Иванович</forename>
                                        <surname>Бельский</surname>
                                    </persName>
                                    <persName xml:lang="eng">
                                        <forename>Vladimir</forename>
                                        <surname>Belsky</surname>
                                    </persName>
                                    <idno type="wikidata">Q1259652</idno>
                                </author>
                            </egXML>
                            <ab>Encoding of the author "Владимир Иванович Бельский" of the play <ref
                                    target="https://dracor.org/id/rus000205">Сказание о невидимом
                                    граде Китеже и деве Февронии</ref>.</ab>
                        </exemplum>
                        <remarks>
                            <ab>For additional information on the encoding of author names and the
                                rationale also see the following GitHub issues:
                                <list>
                                    <item>
                                        <ref type="githubissue"
                                            target="https://github.com/dracor-org/dracor-api/issues/119"
                                            >https://github.com/dracor-org/dracor-api/issues/119</ref>
                                    </item>
                                    <item>
                                        <ref type="githubissue"
                                            target="https://github.com/dracor-org/dracor-schema/issues/21"
                                            >https://github.com/dracor-org/dracor-schema/issues/21</ref>
                                    </item>
                                </list>
                            </ab>
                        </remarks>
                    </elementSpec>

But the question remains how we put that together in the end?

ingoboerner commented 3 months ago

TEI stylesheet for merging TEI ODD specification with source to make a new source document. https://tei-c.org/release/doc/tei-xsl/odds/odd2odd0.html#odd2odd.xsl

This little guide is intended to explain the mechanism of ODD chaining. An ODD file specifies a particular view of the TEI, by selecting particular elements, attributes, etc. from the whole of the TEI. But you can also refine such a specification further, making your ODD derive from another one. In principle you can chain together ODDs in this way as much as you like. You can use this feature in several different ways: • you can add additional restrictions to an existing ODD, for example by changing the value list of an attribute • you can further reduce the subset of elements provided by an existing ODD • you can add new elements or modules to an existing ODD

One [@source] with the value ‘mySuperODD.subset.xml’ will go looking for declarations in a file of that name in the current source tree. And one with the value ‘http://example.com/superODDs/anotherSubset.xml’ will go looking for it at the URL indicated.

https://teic.github.io/PDF/howtoChain.pdf

cmil commented 3 months ago

How about using the TEI Drama ODD provided by the TEI consortium (also available with TEI Roma) as the source for the DraCor ODD. We would have to add some elements like particDesc, standOff and listEvent which seem to be omitted there, and adjust some content models. But then we would perhaps already have a reasonable starting point.

ingoboerner commented 3 months ago

So we would need to use the https://tei-c.org/release/xml/tei/custom/odd/tei_drama.odd in the @source of <schemaSpec> and hope for the best? The old <schemaSpec> already included a good subset of elements I think. Will test it with the drama ODD though.

Legacy ODD included 82 elements; if I would include all modules that were in in the legacy odd we end up with 315 elements

The TEI Drama ODD includes the following modules:

<schemaSpec ident="tei_drama" start="TEI teiCorpus">
        <moduleRef key="header"/>
        <moduleRef key="core"/>
        <moduleRef key="tei"/>
        <moduleRef key="textstructure"/>
        <moduleRef key="linking"/>
        <moduleRef key="drama"/>
<!-- ... -->

The schema contains 226 elements.

cmil commented 3 months ago

So we would need to use the https://tei-c.org/release/xml/tei/custom/odd/tei_drama.odd in the @source of <schemaSpec> and hope for the best? The old <schemaSpec> already included a good subset of elements I think. Will test it with the drama ODD though.

We could use Roma to start from the TEI Drama ODD, add the missing elements there and then use the resulting ODD for further refinement to our purposes.

ingoboerner commented 3 months ago

I already copied it together in my local draft of the ODD. It seems to work without @source, but explicitly re-using this Drama ODD

<div xml:id="div_schema">
                <head>Schema</head>
                <schemaSpec ident="dracor-api" docLang="en" prefix="tei_" xml:lang="en" start="TEI">

                    <!-- modules included in the tei_drama ODD:
                header, core, tei, textstructure, linking, drama
                -->
                    <moduleRef key="header"/>
                    <moduleRef key="core"/>
                    <moduleRef key="tei"/>
                    <moduleRef key="textstructure" except="div1 div2 div3 div4 div5 div6 div7"/>
                    <moduleRef key="linking"/>
                    <moduleRef key="drama"/>

                    <!-- The dracor-legacy ODD also included additional elements from the following modules: -->

                    <moduleRef key="namesdates"
                        include="event forename genName listEvent listPerson listRelation nameLink person personGrp persName relation surname"/>
                    <moduleRef key="corpus" include="particDesc"/>
                    <moduleRef key="figures" include="figure"/>
<!-- ... -->
</schemaSpec>

Results in 233 Elements. Maybe we can later go through the element list and kick some of them out again. Next step would be to look into the requirements of the API , e.g. specific encoding of the digital and original sources in the <bibl> elements in <sourceDesc>. I would do that with Schematron, e.g.

<!-- sourceDesc -->
                    <elementSpec ident="sourceDesc" module="header" mode="change">
                        <constraintSpec ident="digital_source_in_sourceDesc" scheme="schematron"
                            mode="add">
                            <desc>Checks if a digital source is present in the
                                <gi>sourceDesc</gi></desc>
                            <constraint>
                                <sch:rule context="tei:sourceDesc">
                                    <sch:assert test="tei:bibl[@type eq 'digitalSource']">Digital
                                        source is missing </sch:assert>
                                </sch:rule>
                            </constraint>
                        </constraintSpec>
                        <constraintSpec ident="original_source_in_sourceDesc" scheme="schematron"
                            mode="add">
                            <desc>Checks if a original source for a digital source is
                                available</desc>
                            <constraint>
                                <sch:rule
                                    context="tei:sourceDesc/tei:bibl[@type eq 'digitalSource']">
                                    <sch:assert test="tei:bibl[@type eq 'originalSource']">Original
                                        Source for digital source is missing </sch:assert>
                                </sch:rule>
                            </constraint>
                        </constraintSpec>
                    </elementSpec>
ingoboerner commented 3 months ago

OK, I would propose the following:

  1. Proceed with the base schema/odd as agreed above.
  2. Define the "feature" (see D7.1 Report On Programmable Corpora) inside the ODD, e.g.
<div xml:id="play_id">
                        <head>Play ID</head>
                        <p>Feature <idno type="feature-no">P2</idno> <idno type="feature-id">play_id</idno>: <name>DraCor ID</name> of the play, e.g. <val>ger000171</val>.</p>
                        <p>In the TEI source file the <name>DraCor ID</name> is contained in the attribute <att>xml:id</att> on the root element <gi>TEI</gi>.</p>
                        <p>The identifier SHOULD match the Regular Expression <val>^[a-z]+[0-9]{6}$</val>.</p>
                    </div>
  1. Add Schematron rules to check if the API will manage to return data for a feature, e.g.
<constraintSpec ident="valid_dracor_ids_on_root_tei_element"
                            scheme="schematron" mode="add" corresp="#play_id">
                            <desc>DraCor identifiers should consist of lower case letters followed by a six-digit number. The value is returned as feature
                            <ref target="#play_id">play_id</ref> in the API response object.</desc>
                            <constraint>
                                <sch:rule context="tei:TEI" role="warning">
                                    <sch:assert test="matches(./@ xml:id,'^[a-z]+[0-9]{6}$')"> For
                                        DraCor IDs we recommend the pattern ^[a-z]+[0-9]{6}$
                                    </sch:assert>
                                </sch:rule>
                            </constraint>

The result in the rendered HTML ODD:

Bildschirmfoto 2024-06-19 um 13 39 43

The Schematron Rule links to the feature ref/ @corresp:

Bildschirmfoto 2024-06-19 um 13 40 04

The generated RelaxNG contains the Schematron rules and can be used in Oxygen to validate a file. In the example it now produces a warning:

Bildschirmfoto 2024-06-19 um 13 41 08
ingoboerner commented 3 months ago

There is another/additional option to check if a TEI file supports certain API features. In <schemaSpec> we can include <constraintSpec> elements with schematron rules that explicitly report (!) /(not assert) if a certain condition in the encoding is met. An example: If the file contains <title type="main">Whatever Main Title</title> the API will be able to return the title info in the response objects. We can now include a schematron rule/constraintSpec that checks exactly for that and report that a feature is supported

Bildschirmfoto 2024-06-26 um 15 02 56 Bildschirmfoto 2024-06-26 um 15 03 58

if it is not supported, I provide a "Warning" which might help encoders to add the elements that are needed for a feature to be supported:

Bildschirmfoto 2024-06-26 um 15 04 50