distributed-text-services / specifications

Specifications for the DTS API
https://w3id.org/dts
27 stars 9 forks source link

Citation Model and TEI #110

Closed hcayless closed 3 years ago

hcayless commented 6 years ago

Discussion of #101 has surfaced a limitation in TEI itself: it does not expose a sufficient method for declaring the citation model of a document's content. This ticket is intended as a gathering place for an eventual proposal to the TEI for a revised or new citation model declaration.

The requirement as far as DTS is concerned is that an endpoint (either the collections or navigation endpoint—tbd) should be able to expose a machine-readable data structure that can be used to generate a full or partial table of contents for a given text. This means knowing how a text should be "chunked" for delivery. For example, should it be divided into chapters and sections, or books, or books and poems, etc.? Beyond that, some degree of identification of the levels of citation is important, because it will be commonplace to deal with the lowest level of citation differently than its "parents"—we will want to cite lines in a poem, but will not want to generate a ToC that lists each line. From the TEI point of view, what we would like is to be able to say to encoders "put a structure like x in your document, and DTS will automatically know how to deal with it and resolve citations into it".

TEI provides a couple of mechanisms for describing how standard ("canonical") citations should be resolved in the refsDecl element, canonical reference patterns and reference states.

Canonical reference patterns function by providing a regular expression pattern and replacement that can be used to transform a canonical reference form (e.g. "1.2.3") into something that can address the XML structure of the document (e.g. an XPointer like #xpath(//div[@n='1']/ab[@n='2']/l[@n='3']). This is useful, but fails to actually describe how canonical references are constructed. It is also left as an exercise for the reader how to specify what pattern matches what section of the document to resolve, which may make things difficult in cases where document structures are not homogeneous.

Reference states address some of these concerns, but are explicitly limited in the TEI documentation to documents that use milestones to indicate document citation structures. Presumably this is envisioned as a method to be used when the document's XML structure does not match its citation structure. refState might prove more useful as a declaration of citation structure, since it can specify what type of unit is being referenced (using the @unit attribute, e.g. "book", "line", etc.). It lacks, however, cRefPatterns ability to map a citation level to an XML structure.

A third possible mechanism mechanism in TEI would be for each document to add an explicit table of contents in some standard form that DTS could leverage. This might be too verbose for simple cases and might be too much to ask of encoders. An upside is that it would make naming of sections straightforward for cases where that is desirable.

hcayless commented 6 years ago

A further note and a suggestion: some mechanism like this will be necessary (absent prior knowledge) for the construction of a passage listing in the Navigation endpoint. It answers the question of how a DTS navigation endpoint knows what references are available in a document.

Wondering about a structure like the following examples (all element and attribute names up for grabs).

Ref system for http://papyri.info/ddbdp/bgu;1;3/source (which has recto/verso and then lines):

<refsDecl>
    <thing unit="side" match="//div[@type='textpart']" key="@n">
        <thing unit="line" match="//lb" key="@n"/>
    </thing>
</refsDecl>

Ref system for a hypothetical edition of Ovid's Tristia:

<refsDecl>
    <thing unit="book" match="//div[@type='edition']/div[@type='textpart']" key="@n">
        <thing unit="poem" match="//div[@type='edition']/div[@type='textpart']/div[@type='textpart']" key="@n">
            <thing unit="line" match="//div[@type='edition']//l" key="@n"/>
        <thing>
    </thing>
</refsDecl>

The idea being that a program reading this structure could a) construct a ToC at any level by querying the document with the XPath at that level and getting the "name" to be used for it. There may be no need to separate the structural chunk XPath from the name one and there may be no need to nest them. It strikes me that we could almost do this with refState now.

PonteIneptique commented 6 years ago

I am unsure about the role of this discussion in DTS specs itself (though I agree this is an interesting discussion and it is relevant for implementors).

I think I would add another thing : the ability to tell if this is a side reference or the main citation system. Being able to cite note[@n="45"] has not the same semantic than poem[@n="1"]. I'd like to have this kind of ability, like main="true" ? I don't know...

hcayless commented 6 years ago

I don't think it has direct relevance to the spec, but it will be very important to implementations and it may help us decide on DTS's view of the citation model, so it's worth doing here.

If I can try to restate and extend your requirement: we need a way to specify where in the document (e.g. front matter, main text, appendices) a particular refsDecl applies and perhaps along with that what its function is (referencing notes or apparatus entries rather than the main text, e.g.).

PonteIneptique commented 6 years ago

I don't think it has direct relevance to the spec, but it will be very important to implementations and it may help us decide on DTS's view of the citation model, so it's worth doing here.

Agreed. This might find its place in our doc ?

we need a way to specify where in the document (e.g. front matter, main text, appendices) a particular refsDecl applies

Just to make sure : this is the current behavior you are defining here, right ?

perhaps along with that what its function is (referencing notes or apparatus entries rather than the main text, e.g.).

Exactly : being able to make a distinction between editorial content (intro, apparatus, notes, bibliography, person list even ?) and the text for primary sources (Book>Poem>line).

Edit for wrong markdown markup

PietroLiuzzo commented 6 years ago

In my test implementation for Beta Masaheft I used the following

{ "dts:citePattern" : "(\\d+)", "dts:level" : 1, "label" : 'chapter' }

where the information is retrieved from the TEI structure as we regulate it in our project guidelines. i.e. I know that a manuscript has always divs, pb and cb elements and a work record will have always nested divs and can have l etc, so I can provide this information. However, I do also have cases of parallel structuring, which would not fit the current implementation, where for example I have several pb elements (marking the end of the page in the edition or in a witness manuscript, which I use to link the correct photos) and divs, and the divs determine the canonical structure, but more often are the pb the most used citation form.

<div type="textpart" subtype="chapter" n="1" xml:id="chapter1">
               <ab>
                  <pb n="1" corresp="#frisk"/>                  
                  <pb n="257" corresp="#mueller"/>
                  <pb n="40v" corresp="#P" facs="https://digi.ub.uni-heidelberg.de/diglit/iiif/cpgraec398/canvas/0084.json"/>
                  Τῶν ἀποδεδειγμένων ὅρμων τῆς <placeName ref="pleiades:39290">Ἐρυθρᾶς θαλάσσης</placeName> καὶ τῶν περὶ αὐτὴν ἐμπορίων πρῶτός ἐστιν λιμὴν <add place="margin">Μυὸς ὅρμος</add>
                  <metamark>⸏</metamark> τῆς Αἰγύπτου <placeName ref="pleiades:786069">Μυὸς</placeName> ὅρμος,
                  μετὰ δὲ αὐτὸν εἰσπλεόντων ἀπὸ χιλίων ὀκτακοσίων σταδίων ἐν δεξιᾷ <add place="margin">Βερ-νίκη</add> ἡ <placeName ref="pleiades:785986">Βερνίκη</placeName>· ἀμφοτέρων <supplied reason="omitted">δὲ</supplied>
                  oἱ λιμένες ἐν τῷ ἐσχάτῳ <metamark>⸏</metamark>τῆς Αἰγύπτου κόλποι <surplus>δὲ</surplus> τῆς <placeName ref="pleiades:39290">Ἐρυθρᾶς θαλάσσης</placeName> κεῖνται.
               </ab>
            </div>

frisk is a <bibl>, with the reference to the edition, and so is #muller, #P is a (the) <witness>.

my implementation at the moment for cases like this is simply ignoring the pb structure and providing only the canonical.

It would be nice to be able to say that to cite according to #frisk one needs to use the pb[@corresp='#frisk'], to cite according to the manuscript page should look at #P (which would have here a different dts:citePattern) etc. And it would be nice to be able to exchange, i.e. to be able to say Chapter 1 = Frisk 1 = Mueller 257 = P 40v (even nicer if one also had the lines of the edition, which I do not have in this). It would allow to obtain the same effect given by the different numerations on a printed edition.

PietroLiuzzo commented 6 years ago

(this is from the Periplus of the Erythraean Sea, which we happen to have in the collection of text for the Adulis attestation)

PonteIneptique commented 6 years ago
  1. I actually think that TEI should provide a way to express whether your target xpath is a milestone element or a container. Or we can expect knowledge from our user.
  2. We need to discuss when we want to move this discussion over https://github.com/TEIC/TEI/issues
  3. @hcayless I did not see in your example that you did not reuse the current matchPattern / replacementPattern but rather moved to key / path. Any reason for this ? I actually find matchPattern and replacementPattern particularly useful.
  4. @PietroLiuzzo Last week we discussed the potential need for an API "spec" for concordance tables. We agreed upon its need, we are not sure it should be part of DTS at this point (but could become a working group down the line ?)
hcayless commented 6 years ago

@PonteIneptique I deliberately avoided it because, while it works well for the use case "I have a citation (e.g. "1.2.3") and I want to resolve that to part of my TEI document", it doesn't work well for the use case "I want to discover how I can cite parts of my document." Unless I'm mistaken, the latter is the point of a citation model. Really, we should support both use cases, which is why I used path / key. The latter could be used to construct an XPath with predicates, given a citation. A slightly better restatement of my made up example is:

<refsDecl>
    <thing unit="book" match="//div[@type='edition']/div[@type='textpart']" key="@n">
        <thing unit="poem" match="div[@type='textpart']" key="@n">
            <thing unit="line" match="ab//l" key="@n"/>
        <thing>
    </thing>
</refsDecl>

Where the successive "steps" give relative paths. Given a citation like 1.2.3 you could follow the steps to build an XPath that would get you line 3 of poem 2 of book 1. And you could also figure out that there are 5 books, compute the number of poems for each book, see that book 2 doesn't have poems, just lines, etc.

hcayless commented 6 years ago

Wondering whether <refState> could be perverted to our cause:

<refsDecl type="canonical" corresp="#edition">
  <refState unit="book" match="//div[@type='edition']/div[@type='textpart'][@subtype='book']" use="@n" delim="."/>
  <refState unit="poem" match="div[@type='textpart'][@subtype='poem']" use="@n" delim="."/>
  <refState unit="line" match="ab//l" use="@n"/>
</refsDecl>

which would mean borrowing @match from att.scoping and inventing a new attribute @use analogous to xsl:key/@use. It would also require that TEI stop insisting it can only be used with milestone elements.

PonteIneptique commented 6 years ago

My issue with the refState example (might be an oversight) is that it is missing nesting.

My issue with both @match and @use is that it assumes an understanding of identifier collation, unlike the matchPattern that allows for understanding and parsing multiple kind of identifiers... Knowing 1.1.1 should be found by splitting into three units using @delim feels like it is gonna prevent from having different implementations (definitely right now in the refState, not having nested means I don't know what to do with previous identifier item). It also expect only left to right identifiers to some extent...

hcayless commented 6 years ago

It's a thought experiment. What does nesting add?

I may be being dense (it wouldn't be the first time), but I don't understand your objection. Are you suggesting we should be able to support 1,1,1 or 1;1;1 as well? Maybe @delim could be a regex then. Splitting on a regex is a pretty standard function. If you had very different citation schemes, wouldn't you want different <refsDecl>s to support them though? There's a limit to what you can do even with regex replacements.

To my eye, this supports the same functionality as <cRefPattern> as well as telling me how to do things like extract all the references from a document (which I can't see how to do with <cRefPattern>).

Direction might be an issue, but I can imagine ways to deal with it, like numbering my <refStates/>.

PonteIneptique commented 6 years ago

I'll try to explain tomorrow more deeply if the following is not enough (I am quite tired I have to say :) ).

My first question would be : without nesting, and with the current refState example, how do you support the following system :

- book
  - poem
     - line
  - chapter
     - paragraph

My second question, with regard to not using matchPattern, is : if a citation system use right-to-left, how do I know how to dispatch 1.1.1 (note that I am not sure it exists, but I see that as a potential limitation here) ?

hcayless commented 6 years ago

Note that I'm not against nesting, just trying to think through what a minimal set of changes to TEI might support with the thought that the fewer changes are needed, the easier it will be for the Council to approve it. You're right that in the case of a citation system that "branches", nesting might be a more natural way to deal with it.

I think for citation schemes where the order isn't larger structure -> smaller structure (and I don't know of any) we'd have to do something additional, like specify the order of evaluation of the units.

PonteIneptique commented 6 years ago

I know, I hope i do not seem blunt because that is not my goal :). Just trying to find usecase where I think the proposals find limitations...

PonteIneptique commented 6 years ago

To me, @type on refsDecl, @unit on cRefPattern and allowing nesting is what I feel like the less dangerous in terms of uncovered grounds. This does not reads to much into identifier structure, allows for clear nesting.

Btw, I love the idea of type="canonical". I just did not see it before.

hcayless commented 6 years ago

I'd like to see an example parallel to the ones above that uses <cRefPattern>. Ok to wait until you feel awake again!

PonteIneptique commented 6 years ago

Another Use Case Before I continue, another example that leads to have a more robust splitter : Matthew 22:7. I am actually wondering if Biblical studies can also have the equivalent Matt. 22:7 or something like that.

Example for Book-Poem-Line

<refsDecl type="canonical" corresp="#edition">
  <cRefPattern unit="book" matchPattern="(\d+)" 
  replacementPattern="#xpath(/TEI/text/body/div[@n='$1'])">
    <p>I also like the fact that cRefPattern allows for paragraphs</p>
    <cRefPattern unit="poem" matchPattern="(\d+).(\d+)" 
    replacementPattern="#xpath(/TEI/text/body/div[@n='$1']/div[@n='$2'])">
        <cRefPattern unit="line" matchPattern="(\d+).(\d+).(\d+)" 
        replacementPattern="#xpath(/TEI/text/body/div[@n='$1']/div[@n='$2']/l[@n='$3'])">
        </cRefPattern>
    </cRefPattern>
  </cRefPattern>
</refsDecl>

Example : More complex Citation Tree

<refsDecl type="canonical" corresp="#edition">
  <cRefPattern unit="book" matchPattern="(\d+)" 
  replacementPattern="#xpath(/TEI/text/body/div[@n='$1'])">
    <p>I also like the fact that cRefPattern allows for paragraphs</p>
    <cRefPattern unit="poem" matchPattern="(\d+).(\d+)" 
    replacementPattern="#xpath(/TEI/text/body/div[@n='$1']/div[@n='$2' and @subtype='poem'])">
      <cRefPattern unit="line" matchPattern="(\d+).(\d+).(\d+)" 
      replacementPattern="#xpath(/TEI/text/body/div[@n='$1']/div[@n='$2']/l[@n='$3'])">
      </cRefPattern>
    </cRefPattern>
    <cRefPattern unit="chapter" matchPattern="(\d+).(\w+)" 
    replacementPattern="#xpath(/TEI/text/body/div[@n='$1']/div[@n='$2' and @subtype='chapter'])">
      <cRefPattern unit="section" matchPattern="(\d+).(\d+).(\d+)" 
      replacementPattern="#xpath(/TEI/text/body/div[@n='$1']/div[@n='$2']/p[@n='$3'])">
      </cRefPattern>
    </cRefPattern>
  </cRefPattern>
</refsDecl>

Example : Left-to-right Identifier

<refsDecl type="canonical" corresp="#edition">
  <cRefPattern unit="book" matchPattern="(\d+)" 
  replacementPattern="#xpath(/TEI/text/body/div[@n='$1'])">
    <p>I also like the fact that cRefPattern allows for paragraphs</p>
    <cRefPattern unit="poem" matchPattern="(\d+).(\d+)" 
    replacementPattern="#xpath(/TEI/text/body/div[@n='$2']/div[@n='$1' and @subtype='poem'])">
      <cRefPattern unit="line" matchPattern="(\d+).(\d+).(\d+)" 
      replacementPattern="#xpath(/TEI/text/body/div[@n='$3']/div[@n='$2']/l[@n='$1'])">
      </cRefPattern>
    </cRefPattern>
  </cRefPattern>
</refsDecl>

Example : Biblical Example Sorry for the units, I am really not versed into religious studies or modern religions at all..

<refsDecl type="canonical" corresp="#edition">
  <cRefPattern unit="?" matchPattern="(\w+)" 
    replacementPattern="#xpath(/TEI/text/body/div[@n='$1'])">
    <p>I also like the fact that cRefPattern allows for paragraphs</p>
    <cRefPattern unit="?" matchPattern="(\w+)\s+(\d+)" 
        replacementPattern="#xpath(/TEI/text/body/div[@n='$1']/div[@n='$2'])">
      <cRefPattern unit="?" matchPattern="(\w+)\s+(\d+):(\d+)" 
        replacementPattern="#xpath(/TEI/text/body/div[@n='$1']/div[@n='$2']/l[@n='$3'])">
      </cRefPattern>
    </cRefPattern>
  </cRefPattern>
</refsDecl>
PonteIneptique commented 6 years ago

BTW, one of the things that bugs me is (edited typo) the #xpath() requirement for @replacementPattern. I just discovered writing this comment it's most probably because of http://www.tei-c.org/release/doc/tei-p5-doc/fr/html/ref-prefixDef.html . But still

hcayless commented 6 years ago

So there are two things I don't like about this. The first is your own complaint about the #xpath() syntax. The reasoning, I assume, follows from the decision that the result of processing the regex replacement is to be a URI, and XPaths are not URIs, but there is an XPointer syntax to insert and XPath into a URI. It's a bit of a hack.

The second, is that, while this will let me resolve a reference, generating references is going to involve parsing the hacked up XPath and figuring out from it what the significant features are for the citation.

Maybe I'm missing the point here though: I'd like to be able not just to explain in a machine-actionable way, how to resolve citations to my document, but to explain how those are constructed (book.poem.line, e.g.) and how to extract the citations DTS will expose in the Navigation endpoint. I'd like to enable a system using TEI docs and DTS to have no "magic" in it—i.e. no special knowledge of how citations to this edition work and how they are instantiated in the TEI doc.

PonteIneptique commented 6 years ago

While I get your first point, I think this is definitely the cost with not creating too much attributes. And this is one of the thing that have me hesitate about your current refState proposal, which leads to 4 attribute creations instead of 2 in cRefPattern.

What you do not prefer, ie the generating, is to me not an issue, or at least I do not see it this way. Replacing capturing group in replacement pattern is rather easy. I also think that this is the only system that will allow the elasticity do deal with more kind of identifier. My feeling is that the delim and the expected order of thing is gonna constrain much more than it is gonna free people.

Things that I think are necessary for coverage of use-cases, regardless of cReffPattern or refStates, are I do definitely think nesting is a requirement`

The xPath also allows for things that could feel better for some users and developers such as :

<refsDecl type="canonical" corresp="#edition">
  <cRefPattern unit="book" matchPattern="(\d+)" 
  replacementPattern="#xpath(/TEI/text/body/div[@n='$1'])">
    <p>I also like the fact that cRefPattern allows for paragraphs</p>
    <cRefPattern unit="poem" matchPattern="(\d+).(\d+)" 
    replacementPattern="#xpath(/TEI/text/body/div/div[@n='$1.$2'])">
        <cRefPattern unit="line" matchPattern="(\d+).(\d+).(\d+)" 
        replacementPattern="#xpath(/TEI/text/body/div/div/l[@n='$1.$2.$3'])">
        </cRefPattern>
    </cRefPattern>
  </cRefPattern>
</refsDecl>

which is the most cost-efficient way of implementing citation system, although I believe it to be potentially not fool proof and hard to maintain (hence the original Capitains decisions).

PonteIneptique commented 6 years ago

I edited all my proposal to remove tei: prefixes that are actually not necessary and completely obfuscate the original code.

hcayless commented 6 years ago

So is it the case that knowing how to get references from a document in a DTS system is just a priori knowledge we expect the implementor or system manager to have? If the following is not a use case we care about then your solution is probably fine.

Use case: "As the manager of a DTS instance, I want to be able to add a properly-configured TEI document to my system without having to manually generate a citation-to-document chunk mapping."

Using the cRefPattern method, I actually can't think of an algorithm to automatically extract passage identifiers for a given level. That will get super messy and (I think) require an XPath parser in whatever language you're using. But if it's not a requirement, then maybe it's not a problem.

PonteIneptique commented 6 years ago

Well, there is two or three things in your previous comments that I'd like to reply to, sometime agreeing, sometime not.

warning : I feel like there is two readings of your last paragraph. I address both of them in the second part of my comment but not in the first.

I feel like we can reach a consensus though. The differences between both approaches are limited.

  1. Nesting is a possibility for both, so there is that.
  2. The second important - to me - question is the freedom of identifier pattern.
  3. The last is some freedom of implementation : while Capitains pushes for a level-identifier system (book has 1, poem has 1 : identifier is 1.1), most people implementing passage identifier would do 1.1 in poem, 1 in book because it's consuming less energy.
  4. One of your major concern is, if I understand correctly, not citation matching but citation list building ? There is algorithm for that that can be built.
  5. Another reading I have of your upper comment is identifier building from a specific place in the document ? ie I go to some /TEI/text/body/div/div/l, what citation am I in ? I have to say, I could see this fixed by something like a simplified xPath pattern along the important one :
        <cRefPattern unit="line" matchPattern="(\d+).(\d+).(\d+)" 
        replacementPattern="#xpath(/TEI/text/body/div[@n='$1']/div[@n='$2']/l[@n='$3'])"
        recognitionPattern="#xpath(/TEI/text/body/div/div/l)"/>

which would allow, from bottom to top of the tree, to check for the deepest match and then reconstruct the identifier. Feels repetitive though.

I might not have completely understood the requirement. My (limited) experience is that most of the time, it's a top to bottom process : I get a list of identifiers, I pick identifiers, I retrieve nodes identified by the identifier. Not much the other way around. I feel like recognitionPattern would be an answer to this issue by simplifying the awful need to parse.

PS : For the joke, I actually understood part of your argument in one of those developers dreams. I need other hobbies...

hcayless commented 6 years ago

@PonteIneptique I really apologize if I'm making you dream in XML. I think we're getting closer. Here's the perspective I'm thinking of: Imagine I'm running the DLL DTS instance. I get submissions of new editions. Can I get each edition to declare its citation model in such a way that I don't have to do any extra work to ingest it (e.g. establish a custom set of XPaths to extract citations). Do I have to examine every text to know how I should chunk it, or can the text tell the system how it thinks it should be chunked? So that's what my thought experiment encoding is trying to address:

  1. This is my citation scheme — @unit on <thing> or <refState>
  2. These are my chunks — @match
  3. These are my chunk identifiers — @use or, previously, @key

and my concern with cRefPattern is that it says only "this is my citation scheme and this is how it maps to chunks." I can see how you could extract chunk XPaths and identifiers from the @replacementPattern in the simple case, but I'm worried that someone could write a perfectly valid one that would break everything.

PonteIneptique commented 6 years ago

I'll come back to the whole thing later, but I have to say :

I'm worried that someone could write a perfectly valid one that would break everything.

Whatever we come up with, I feel like this will happen :D

Hence the practices we have today in Capitains to have CI to ensure things are okay :)

Maybe we can have a call now that examples are clear for both of us ?

PonteIneptique commented 6 years ago

Result of the conversation : moving to wiki

<tocPattern
  unit="book"
  matchPattern="(\d+)"
  xpath="/TEI/text/body/div[@n='$1']">

  <tocAbout unit="title" xpath="./title" />

  <tocPattern
    unit="book"
    matchPattern="(\d+)"
    xpath="/TEI/text/body/div[@n='$1']" /> <!-- Default ? -->

</tocPattern>
PonteIneptique commented 6 years ago

Examples and stuff : https://github.com/distributed-text-services/tei-proposal

PonteIneptique commented 5 years ago

Dear all, after a meeting with @hcayless , we kinda agreed on a starting point for our proposal to TEI Technical Council, the new elements are available here : https://github.com/distributed-text-services/tei-proposal/blob/master/NewProposal.md

We'd like it to be discussed during the next meeting if possible.

balmas commented 5 years ago

Could one of you provide a very brief summary of how the proposal relates to the current TEI 5 spec? At a quick glance it looks like it introduces a new child element for refsDecl named tocElement which has a child element named metadataDecl which is also new. Is that correct?

PonteIneptique commented 3 years ago

We can close this ! WOOOOOOOOHOOOOOOOO @hcayless :D