What's the citation model for DTS Resource ?

PonteIneptique commented 6 years ago

In the discussion we had, we discussed having tei:refsDecl in the metadata of Resource in the Collection API. Basically, my examples covered that in this way :

{
    "@id" : "urn:cts:latinLit:phi1103.phi001.lascivaroma-lat1",
    "@type": "Resource",
    "...": "...",
    "tei:refsDecl": [
        {
            "tei:matchPattern":  "(\w+)",
            "tei:replacementPattern": "#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1'])",
            "@type": "poem"
        },
        {
            "tei:matchPattern":  "(\w+)\.(\w+)",
            "tei:replacementPattern": "#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1']//tei:l[@n='$2'])",
            "type": "line"
        }
    ]
}

Which would result in having the object

capable of citation by "poem" or by "line" (because of type)
passages should always be matched by either of the matchPattern
the replacementPattern should lead to the beginning (in case of milestones ?) or container(s) of the element

I open this issue because we skipped really quickly over it in talks, agreed upon it only generally.

I think I'd have the following question :

Should we actually clarify level of citation : example, the second citation in at depth 2 (lines within poem). This is something that we cannot capture by the simple expression of match pattern actually. In the CapiTainS draft guidelines, I used the attribute tei:corresp for that but maybe we should use something like dts:depth ?
Should we make the tei:replacementPattern optional ? Or actually is there anything in there that we might feel is going too far ? (Noting that at least the citation structure is important for CTS compatibility)
I actually also think we should move @type to tei:type in the examples.
We could also make thing more complicated (but more straight forward) and allow people to build "graph" of citation system (properties name were chosen to be expressive for the example) :

[{
   "dts:citation_id": "1",
   "tei:matchPattern":  "(\w+)",
   "tei:replacementPattern": "#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1'])",         
   "@type": "poem"
},
{
   "dts:citation_id": "2",
   "dts:citation_parent": "1",
   "tei:matchPattern":  "(\w+)\.(\w+)",
   "tei:replacementPattern": "#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='$1']//tei:l[@n='$2'])",
   "type": "line"
}]

Note that while some of that might seem to be too much, they all are partial responses to real problems in the understanding from the parser standpoint of the structure of the text...

PonteIneptique commented 6 years ago

I'd actually had that I would recommend moving the namespace to "http://www.tei-c.org/ns/1.0#" instead of "http://www.tei-c.org/ns/1.0" otherwise prefix extension produces "http://www.tei-c.org/ns/1.0matchPattern" and that's awful.

Or "http://www.tei-c.org/ns/1.0/" btw. Both are good to me.

balmas commented 6 years ago

I think it might be good to be explicit about level, but the @corresp attribute doesn't really seem appropriate for that for me. Using dts:level might be better, if it validates. But I think we could also infer this from the location in the refsDecl list. Either way I think would probably be ok.
If tei:replacementPattern is optional then the matchPattern seems a little meaningless to me.
agree (assuming it validates)
I don't understand where this would be declared.

PonteIneptique commented 6 years ago

I updated number 4 so that it's clearer.

My answer here is mostly targeted at your 1 : we actually can't infer because there is a possibility people have a citation complex tree rather than a citation "line". Most of CTS texts have wonderful "book->poem->line" but what about

book
- poem
  - stanza
    - line
- paragraph
  - segment

Here, the CTS model would fail most probably. You can have different match pattern (let say poems are numbered while paragraph are [a-zA-Z]+ in regexp). Here, you would not be able to infer the level of the citation. While we could with wonderful CTS DTSIzed object, because the dot . means hierarchy, for any other text that would go with a complex tree, we would be powerless to understand the relationship between citation nodes.

emmamorlock commented 6 years ago

My 2 cents:

Don't you think a "paragraph" could contain anything and not just [a-zA-Z]+?
in question 4, isn't the example less a graph than a straightforward hierarchy (declared via "dts:citation_parent")?
what I have is:
- div
  - ab with mixed content with two types of milestones:
    - lb (with @n)
    - milestones (with @unit and @corresp)
NB: the @corresp is essential to establish a relation with the abstracts textual corresponding units that are declared in msContents/msItem...

PonteIneptique commented 6 years ago

Quick answers :

That was only an example to show that passages could be numbers for lines while letters could be paragraph identifiers. Just showing that we might have this kind of diversity.
Technically, a hierarchy is a graph, but I don't think this is the question . Yes, I definitely gave a simple example in ex.4 but https://github.com/distributed-text-services/collection-api/issues/101#issuecomment-390110183 shows that we might have more complex ones.
Noted. Unfortunately, I have not seen in TEI any attributes that could cover depth of citation scheme or type actually, and this is also an issue for the future capitains guidelines.

hcayless commented 6 years ago

I'm confused about what this is meant to achieve (possibly I just haven't had enough coffee yet). Canonical References in TEI allow you to construct a custom URI referencing system, which is fine and good. But I'm missing the point of them here. Shouldn't the Reference API just tell you what sorts of references you can have? Why should the Collection API bother telling you how they're constructed?

balmas commented 6 years ago

and @hcayless 's comment makes me realize I misunderstood the point of this issue. I thought we were talking about the TEI refsDecl structure ... I clearly had either had not enough or too much coffee myself at that point :-)

To respond to Hugh's point, I could see the DTS API making this information available being useful for purposes of a chain of provenance or reproducibility.

To reframe my answers to the above in the correct context:

dts:depth makes the most sense to me here, in the context of the DTS API.
if I am correct that the point of this is for reproducibility, then I think replacementPattern should be present.
Does using tei:type make too many assumptions about the textual markup? What if the citation doesn't correspond to something that was identified that way?
The graph approach is tempting, but I'm a little worried it would increase the complexity of implementation

PonteIneptique commented 6 years ago

The issue with the reference API is that it throws at you references, but for example, one of the very common thing I do with CTS APIs is : Retrieve Text Metadata -> Retrieve all References at Deepest Level (thanks to Text Metadata) -> Retrieve passages based on the last results.

Right now, our system cannot provide this kind of workflow because we do not have a space to state how the references of the text are structured.

hcayless commented 6 years ago

Ok. I see the point of that use case, but I don't see yet how having the Collection API give you TEI cRefPatterns helps. Maybe I'm being dense. I see the problem, but I don't see how this is a solution.

Wouldn't it be better to come up with some declarative representation of the available levels and how citations to them are constructed? Put another way, I see the point of the matchPattern, but not the replacementPattern. As a client, I don't care how you're getting the chunk of text I want, and I wouldn't care unless I wanted to grab the document and do it myself.

What about something like:

{
    "@id" : "urn:cts:latinLit:phi1103.phi001.lascivaroma-lat1",
    "@type": "Resource",
    "...": "...",
    "dts:citeStructure": [
        {
            "dts:citePattern":  "(\\w+)",
            "dts:level": 1,
            "label": "poem"
        },
        {
            "dts:citePattern":  "(\\w+)\\.(\\w+)",
            "dts:level": 2,
            "label": "line"
        }
    ]
}

Seems like IRI templates might be better for this than regex patterns though...

PonteIneptique commented 6 years ago

I'd be totally for it. It's just that we talked about it being based on cRefPattern, but your proposed structure is good to me.

hcayless commented 6 years ago

I probably failed to properly think through the implications when we talked bout it, but now I think it's better to just tell the client how citations are constructed than to give it implementation details it can't really use.

PonteIneptique commented 6 years ago

I am not completely certain of the match pattern and replacement pattern use (whatever the namespace or implementation is). On the other end, having information about the "citation graph" structure and metadata about it seems to me important as well :) I think we have an agreement here right ?

jonathanrobie commented 6 years ago

How about a URI template along these lines:

 {
  "tei:replacementPattern": "#xpath(/tei:TEI/tei:text/tei:body/tei:div/tei:div[@n='{&n}])",
 }

PonteIneptique commented 6 years ago

Recursivity and graph description of citation scheme :

{
    "@id" : "urn:cts:latinLit:phi1103.phi001.lascivaroma-lat1",
    "@type": "Resource",
    "...": "...",
    "dts:citeStructure": [
        {
            "label": "poem",
            "dts:citeStructure": [
                {
                    "label": "line"
                }
            ]
        }
    ]
}

jonathanrobie commented 6 years ago

I need to understand the requirements and use case better. If I am in a client, what are the sequence of steps I am taking when I encounter this data, and what do I want to do with it? I assume we have to be able to handle any kind of reference the same way, supporting CTS and other references that may be quite different.

Are you looking for a way to describe the citation structure for a given resource? What do you want the client to do with it?

A set of use cases written down in this issue would be helpful.

PonteIneptique commented 6 years ago

    "dts:citeStructure": [
        {
            "dts:level": 1,
            "label": ["poem", "section"]
        },
        {
            "dts:level": 2,
            "label": ["line", "paragraph"]
        }
    ]

PonteIneptique commented 6 years ago

Three simple use cases :

As a presenting app, I want to be able to take general decisions about how the text should be shown to the client depending on its structure. ie, if a text is book-poem|chapter-line|paragraph, I want to show the text by poem|chapter, so at level 2
As a collection curator, I want to be able to specify the structure of my text (which is just another metadata).
As a corpus researcher, I want to be able to know where my narratives cut my occur, ie where cooccurence of words is irrelevant at passage boundaries (last word of poem 1 is not a relevant co-occurence of first word of poem 2)

mromanello commented 6 years ago

I'd like to add a further use case, coming from a citation matching perspective which directly derives from what I'm doing with the CTS API via Capitains resolvers to build HuCit a knowledge base of classical texts and citable text units.

as a citation matching system, I want to retrieve information about text structures from a DTS collection. Knowing how many hierarchical levels a given text has, and what these are, it's a useful information that can be exploited when resolving ambiguous references.

I give a concrete example of this use case at p. 108 of my PhD dissertation:

hcayless commented 6 years ago

I still have some misgivings about this. The example I mentioned in our last meeting was Ovid's Tristia, where you have a general structure of book, poem, line, but Book 2 is a single, almost 600-line poem. You'll note if you go to Book 2 in Perseus, that it doesn't bother to chunk it the way it does (e.g.) the Aeneid Book 1 (despite their similar length).

I understand wanting to tell a client what the levels are, but I'd want to be able to do that in a useful way. As an API client, If I was deciding how to chunk things, I could certainly do it on Book / poem for most of the Tristia, but I'd want to (maybe) do it on Book / 20-30 lines for Book 2.

PonteIneptique commented 6 years ago

This becomes more and more complicated right :) One option for this would be to allow to display schemes

  "dts:citeStructure": [
       {"@value": ["book", "poem", "line"]},
       {"@value": ["book", "line"]}
    ]

But it would definitely start to make things complicated if you have more than - say - 3 or 4 different schemes. Again, if we want to have full details, maybe this would be up to the Navigation endpoint ?

PonteIneptique commented 6 years ago

Option I back for next week is https://github.com/distributed-text-services/specifications/issues/101#issuecomment-395418918

PonteIneptique commented 6 years ago

Action item : do a pull request with comment on top with citeDepth on top of it ?

PonteIneptique commented 6 years ago

Fixed in #104

distributed-text-services / specifications

What's the citation model for DTS Resource ? #101