distributed-text-services / specifications

Specifications for the DTS API
https://w3id.org/dts
28 stars 9 forks source link

Distinguish Passages, Documents, and References requests? #64

Closed balmas closed 6 years ago

balmas commented 7 years ago

From @jonathanrobie on September 13, 2017 20:36

In the current API, we distinguish /collections, which contains metadata for both documents and collections, from /documents, which contains the text of documents per se.

In a conversation today, Hugh suggested that we break this down as follows:

The following operations would be supported:

If we take this approach, we probably need a URI parameter to specify the granularity for a passage request. Do we want to support granularity that corresponds to TEI elements, linguistic units, or both? If we decide to take this approach, I will open a separate issue for that question.

Copied from original issue: distributed-text-services/distributed-text-services.github.io#9

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 7:13

We need to be careful of overlapping issue : this issue takes over #3 and #5 at the same time.

Outside of the search endpoint which I think should remain separate for documents/passage, I actually think that splitting up documents/passage is an unnecessary burden. If you have a passage keyword, then you reduce your document size, as simple as that.

If we take this approach, we probably need a URI parameter to specify the granularity for a passage request. Do we want to support granularity that corresponds to TEI elements, linguistic units, or both? If we decide to take this approach, I will open a separate issue for that question.

I honestly am not sure I understand this part. We already have, in your proposal, a specification for passage ID.

The identification system is up to the implementation : if one wishes to use word identifier, one can. If one wishes to use XPath, they can. All in all, those identifiers should be identified through the reference passage, so the 1-to-1 link with TEI element for example is absolutely unnecessary and would most probably prove harmful for other people to engage with our system.

balmas commented 7 years ago

From @hcayless on September 14, 2017 12:48

My argument to @jonathanrobie was that passages actually have a different return type than documents. Even if we do restrict supported documents to TEI-only, a TEI fragment is not a TEI document. It may not even be XML, or at least not well-formed XML. So, given that they're doing different things, we should consider splitting them up.

balmas commented 7 years ago

From @jonathanrobie on September 14, 2017 12:57

By granularity, I mean this: for a given match, do you want to return the passage? the sentence? the paragraph? something else? A URI parameter could specify that.

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 12:58

Now I see the rational behind the argument. I am not sold on it though.

I can definitely see the main reason : depending on the quoting system offered, it might be good to be able to send simplified resources (ie between two milestone, regardless of the structure of the XML). But the idea of not well formed XML is a bit... well, let's say I am afraid of the consequences of such freedom, and I can only have nightmares for people who will need to use the API as client and try to parse such things. Think about old HTML website (not xhtml) that you try to open. In python, they even had to build a parser and a library just for this kind of resource...

I would argue that even if it does not take the original build of the document, at least the general build of the response should look like a TEI document, even if it has only a substract, ie :

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Title</title>
         </titleStmt>
         <publicationStmt>
            <p>Information about publication or distribution</p>
         </publicationStmt>
         <sourceDesc>
            <p>Information about the source</p>
         </sourceDesc>
      </fileDesc>
  </teiHeader>
  <text><!-- And here the fragment -->
  </text>
</TEI>

And I am not even sure the teiHeader would be required in this context.

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 12:59

@jonathanrobie Could you provide one or two clear examples (with mocked up data) for us to understand ? I really have difficulties to understand what you mean

balmas commented 7 years ago

From @hcayless on September 14, 2017 13:3

@PonteIneptique: If you deliver a TEI fragment between two milestones, it probably won't be well-formed unless you supply a wrapper element. I really, really dislike the idea of dressing up fragments as full documents. Not simply because it's overly verbose, but because you're actually making stuff up. You're advertising this thing as a full TEI document when it's not that at all.

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 13:6

@jonathanrobie There might be one thing that I understand : in certain type of content (translations for example) we might want to use range. Ie :

<l n="1">Lorem ipsum</l>
<l n="2">Video ipsum</l>
<l n="3">Audio ipsum</l>

could be translated over one line by a 19th century poet

<l n="1-3">Lorem ipsum</l>

And we might need to indicate, somehow, that the result is fuzzy matching. (Ie this is not only what has been asked)

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 13:9

@hcayless I do completely understand your argument, be sure of that :) And I see the positive point behind this (really I do)

But I also see the consequences for clients building, and these are terribles...

Note that

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <text><!-- And here the fragment -->
  </text>
</TEI>

Is not that overly verbosed :)

But at least I think we should require valid XML. Is there any fragment guidelines for TEI ?

balmas commented 7 years ago

From @hcayless on September 14, 2017 13:31

None whatsoever! I expect by "valid" you mean well-formed. Validity is a whole other question, but a schema is required there. There are ways to validate well-formed chunks of TEI, but the question becomes: if you're going to fake a wrapper for your result, what does that look like?

Your suggestion won't work (or at least may not be valid), because <text> has a limited content model (can't contain <p>, for example). We could wrap it in something else non-TEI, but then you can't really claim it's a TEI document. Or we could tell clients that they should wrap the results themselves before parsing. Or, we could produce a more realistic faked-up, but valid TEI result. This would require some knowledge of the individual document structure, so we know what sort of wrapper to provide, but it's possible (indeed, I've done it before). But again, I'm uncomfortable, because we've faked a document.

If I ask for line 8 of a random papyri.info document, I'll probably get something like:

<lb n="8"/>ἔτει Ἀμοῦνις Κιαλὴ καὶ <supplied reason="lost">Ὧ</supplied>ρος <expan>Ταθή<ex cert="low">μιος</ex></expan> καὶ

I'm not opposed to wrapping results like this, but we'd need to decide how to do it. We might want to propose some fragment guidelines to TEI, in fact.

balmas commented 7 years ago

From @hcayless on September 14, 2017 13:34

I could see proposing a new TEI element (<fragment>?) that could contain any TEI element or text. I'm sure it would be accepted if we could make a good argument for it.

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 13:35

Having lived both the provider (and headache)'s problem and the client problem (come one, not well-formed XML ???), I think your comment shows common understanding.

I know about the limitation for text node, technically the cost to build a little more (ie basically all can be contained in a div). But I also see the argument against fake document (though I do not adhere with it)

But I do love the <fragment> node's idea. Really.


And then document endpoint (Which would merge document and passage) could be either a TEI document or a TEI fragment.


You could need fragmentCollection on top of fragment, for search result. Though, I am largely advocating for JSON/XML convertible scheme on the search result with some kind of ontology if needed... Search result in JSON seems totally natural to me :)

balmas commented 7 years ago

From @jonathanrobie on September 14, 2017 13:43

I asked the TEI list if there is a standard element for this:

http://tei-l.970651.n3.nabble.com/Returning-fragments-from-TEI-documents-td4030077.html

balmas commented 7 years ago

From @jonathanrobie on September 14, 2017 13:44

Search result in JSON seems totally natural to me :)

With the corresponding XML as text? That could certainly work.

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 13:47

Search result in JSON seems totally natural to me :)

With the corresponding XML as text? That could certainly work.

Well, with whatever kind of content actually for this one. The important part is to give you the identifier of the passage/document that contains the match, way more than the XML. This is search, not retrieval :)

jeffreycwitt commented 7 years ago

I think my comment on #68 is relevant here, at least to an earlier part of the discussion about the distinction between passages and documents: See https://github.com/distributed-text-services/collection-api/issues/68#issuecomment-332840383

jonathanrobie commented 7 years ago

In today's call, we agreed on the following:

jonathanrobie commented 7 years ago

I think we need a clearly stated organizing principle that tells us the relationship between endpoints, resources, and affordances (operations you can perform on resources). Defining that relationship clearly will tell us how many endpoints we need and what each does, and will help us converge on our design.

Here are some issues I see in the current design:

To me, this is the heart of the current issue, and we cannot answer it one endpoint at a time.

PonteIneptique commented 7 years ago

My first comment would be that separating the search endpoint was decided upon discussion at meeting 1.

My second remark would be the following : our objects seems to be only Metadata and Texts. But it is our reading that makes this separation. One could argue that we have one object only, Text, with derived metadata - references, catalog metadata - and with derived chunking - search hits, passage, full document.

Though, such a division would be harmful and we started dividing things to allow for a more result-type approach : catalog metadata, search hits, document's node, references. Those four result types have enough difference to be found in independent endpoint.

On top of that, I would like to state a little more than to me, browsing, retrieval and search are really different activities because they do not cover the same mass of information :

Such different types of objects and such different types of activities forces us - for clarity - to separate what can be separated : ie References, TextualNodes and Collection Metadata retrieval and browsing on one side, search on the other. From a client side perspective, such a separation is much more newcomers-proof and allows for a clear separation of concerns.

I would like to add that, as Vincent Jolivet has told me many times (See Vincent, I listen !), Search is a really fuzzy word that covers many many realities (tag search, text search, what does text search means ? does attributes of original source if original source is in xml are searchable text ?) and as such is really complex to define for all projects that might adopt our API. On top of that, I'd add that implementing text search, even plain text search, has costs that are much much bigger than content retrieval and browsing. As such, it is important that it can be separated and clearly turned-off for small size projects.

PonteIneptique commented 6 years ago

For cleaning reason, we are closing old issues. Please feel free to comment if you think we should reopen.