Distinguish Passages, Documents, and References requests?

balmas commented 7 years ago

From @jonathanrobie on September 13, 2017 20:36

In the current API, we distinguish /collections, which contains metadata for both documents and collections, from /documents, which contains the text of documents per se.

In a conversation today, Hugh suggested that we break this down as follows:

collections: metadata for documents and collections
documents: complete documents, which are returned as a unit
passages: passages from documents, which are returned as partial documents with paging
references: references associated with documents

The following operations would be supported:

Metadata search (kw=pair): supported for collections. Searches metadata, returns the metadata for a document or collection, which includes the URI needed to retrieve it.
Full text search (q=string): supported for collections, documents, and passages. For collections, full text is applied to metadata, and metadata for a collection or document is returned. For documents, full text is applied to document text, and the document is returned. For passages, full text is applied to document text, and passages that match are returned (see note on granularity below).
Passage lookup: supported for documents and passages. For documents, the URN of the document is used to identify the document, which is returned, and any passage identifiers are ignored. For passages, the URN of the document plus a passage identifier are used to identify a passage, which is returned in a payload that includes next and prev links in the response header.
References lookup: supported for references. A document URN and (optional) passage identifier are used to identify a starting point within a document, and URI parameters specify what should be returned: ancestors, parent, siblings, following siblings, preceding siblings, or all descendants. (The exact set of queries supported obviously could be a subset of these, or even a superset - we can open another issue for that.)

If we take this approach, we probably need a URI parameter to specify the granularity for a passage request. Do we want to support granularity that corresponds to TEI elements, linguistic units, or both? If we decide to take this approach, I will open a separate issue for that question.

Copied from original issue: distributed-text-services/distributed-text-services.github.io#9

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 7:13

We need to be careful of overlapping issue : this issue takes over #3 and #5 at the same time.

Outside of the search endpoint which I think should remain separate for documents/passage, I actually think that splitting up documents/passage is an unnecessary burden. If you have a passage keyword, then you reduce your document size, as simple as that.

If we take this approach, we probably need a URI parameter to specify the granularity for a passage request. Do we want to support granularity that corresponds to TEI elements, linguistic units, or both? If we decide to take this approach, I will open a separate issue for that question.

I honestly am not sure I understand this part. We already have, in your proposal, a specification for passage ID.

The identification system is up to the implementation : if one wishes to use word identifier, one can. If one wishes to use XPath, they can. All in all, those identifiers should be identified through the reference passage, so the 1-to-1 link with TEI element for example is absolutely unnecessary and would most probably prove harmful for other people to engage with our system.

balmas commented 7 years ago

From @hcayless on September 14, 2017 12:48

My argument to @jonathanrobie was that passages actually have a different return type than documents. Even if we do restrict supported documents to TEI-only, a TEI fragment is not a TEI document. It may not even be XML, or at least not well-formed XML. So, given that they're doing different things, we should consider splitting them up.

balmas commented 7 years ago

From @jonathanrobie on September 14, 2017 12:57

By granularity, I mean this: for a given match, do you want to return the passage? the sentence? the paragraph? something else? A URI parameter could specify that.

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 12:58

Now I see the rational behind the argument. I am not sold on it though.

I can definitely see the main reason : depending on the quoting system offered, it might be good to be able to send simplified resources (ie between two milestone, regardless of the structure of the XML). But the idea of not well formed XML is a bit... well, let's say I am afraid of the consequences of such freedom, and I can only have nightmares for people who will need to use the API as client and try to parse such things. Think about old HTML website (not xhtml) that you try to open. In python, they even had to build a parser and a library just for this kind of resource...

I would argue that even if it does not take the original build of the document, at least the general build of the response should look like a TEI document, even if it has only a substract, ie :

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Title</title>
         </titleStmt>
         <publicationStmt>
            <p>Information about publication or distribution</p>
         </publicationStmt>
         <sourceDesc>
            <p>Information about the source</p>
         </sourceDesc>
      </fileDesc>
  </teiHeader>
  <text><!-- And here the fragment -->
  </text>
</TEI>

And I am not even sure the teiHeader would be required in this context.

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 12:59

@jonathanrobie Could you provide one or two clear examples (with mocked up data) for us to understand ? I really have difficulties to understand what you mean

balmas commented 7 years ago

From @hcayless on September 14, 2017 13:3

@PonteIneptique: If you deliver a TEI fragment between two milestones, it probably won't be well-formed unless you supply a wrapper element. I really, really dislike the idea of dressing up fragments as full documents. Not simply because it's overly verbose, but because you're actually making stuff up. You're advertising this thing as a full TEI document when it's not that at all.

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 13:6

@jonathanrobie There might be one thing that I understand : in certain type of content (translations for example) we might want to use range. Ie :

<l n="1">Lorem ipsum</l>
<l n="2">Video ipsum</l>
<l n="3">Audio ipsum</l>

could be translated over one line by a 19th century poet

<l n="1-3">Lorem ipsum</l>

And we might need to indicate, somehow, that the result is fuzzy matching. (Ie this is not only what has been asked)

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 13:9

@hcayless I do completely understand your argument, be sure of that :) And I see the positive point behind this (really I do)

But I also see the consequences for clients building, and these are terribles...

Note that

<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <text><!-- And here the fragment -->
  </text>
</TEI>

Is not that overly verbosed :)

But at least I think we should require valid XML. Is there any fragment guidelines for TEI ?

balmas commented 7 years ago

From @hcayless on September 14, 2017 13:31

None whatsoever! I expect by "valid" you mean well-formed. Validity is a whole other question, but a schema is required there. There are ways to validate well-formed chunks of TEI, but the question becomes: if you're going to fake a wrapper for your result, what does that look like?

Your suggestion won't work (or at least may not be valid), because <text> has a limited content model (can't contain <p>, for example). We could wrap it in something else non-TEI, but then you can't really claim it's a TEI document. Or we could tell clients that they should wrap the results themselves before parsing. Or, we could produce a more realistic faked-up, but valid TEI result. This would require some knowledge of the individual document structure, so we know what sort of wrapper to provide, but it's possible (indeed, I've done it before). But again, I'm uncomfortable, because we've faked a document.

If I ask for line 8 of a random papyri.info document, I'll probably get something like:

<lb n="8"/>ἔτει Ἀμοῦνις Κιαλὴ καὶ <supplied reason="lost">Ὧ</supplied>ρος <expan>Ταθή<ex cert="low">μιος</ex></expan> καὶ

I'm not opposed to wrapping results like this, but we'd need to decide how to do it. We might want to propose some fragment guidelines to TEI, in fact.

balmas commented 7 years ago

From @hcayless on September 14, 2017 13:34

I could see proposing a new TEI element (<fragment>?) that could contain any TEI element or text. I'm sure it would be accepted if we could make a good argument for it.

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 13:35

Having lived both the provider (and headache)'s problem and the client problem (come one, not well-formed XML ???), I think your comment shows common understanding.

I know about the limitation for text node, technically the cost to build a little more (ie basically all can be contained in a div). But I also see the argument against fake document (though I do not adhere with it)

But I do love the <fragment> node's idea. Really.

And then document endpoint (Which would merge document and passage) could be either a TEI document or a TEI fragment.

You could need fragmentCollection on top of fragment, for search result. Though, I am largely advocating for JSON/XML convertible scheme on the search result with some kind of ontology if needed... Search result in JSON seems totally natural to me :)

balmas commented 7 years ago

From @jonathanrobie on September 14, 2017 13:43

I asked the TEI list if there is a standard element for this:

http://tei-l.970651.n3.nabble.com/Returning-fragments-from-TEI-documents-td4030077.html

balmas commented 7 years ago

From @jonathanrobie on September 14, 2017 13:44

Search result in JSON seems totally natural to me :)

With the corresponding XML as text? That could certainly work.

balmas commented 7 years ago

From @PonteIneptique on September 14, 2017 13:47

Search result in JSON seems totally natural to me :)

With the corresponding XML as text? That could certainly work.

Well, with whatever kind of content actually for this one. The important part is to give you the identifier of the passage/document that contains the match, way more than the XML. This is search, not retrieval :)

jeffreycwitt commented 7 years ago

I think my comment on #68 is relevant here, at least to an earlier part of the discussion about the distinction between passages and documents: See https://github.com/distributed-text-services/collection-api/issues/68#issuecomment-332840383

jonathanrobie commented 7 years ago

In today's call, we agreed on the following:

A document is whatever is retrieved when a document identifier is specified in a request.
A passage is whatever is retrieved when a document identifier and passage identifier are specified in a request. It need not correspond to any physical or logical unit.
The same endpoint will be used for retrieving documents or retrieving passages, depending on whether a passage identifier is included in the request.
Passages belong to documents. Not all documents have passages, it is perfectly acceptable for a document to exist only as a whole.
Passages can be organized sequentially and hierarchically.
An API along the lines of the one we are discussing in Issue #68 can be used to identify the passages in a document and the relationships among them.

jonathanrobie commented 7 years ago

I think we need a clearly stated organizing principle that tells us the relationship between endpoints, resources, and affordances (operations you can perform on resources). Defining that relationship clearly will tell us how many endpoints we need and what each does, and will help us converge on our design.

Here are some issues I see in the current design:

Some endpoints correspond to resources (e.g. documents, collections). Others correspond to operations on resources (e.g. references, navigation, search).
Search is a separate endpoint. Find is not. Are these not both operations on collections of resources?
Documents and passages are now accessed via a single endpoint because a passage is part of a document. Is that not also true of the references in a document?

To me, this is the heart of the current issue, and we cannot answer it one endpoint at a time.

PonteIneptique commented 7 years ago

My first comment would be that separating the search endpoint was decided upon discussion at meeting 1.

My second remark would be the following : our objects seems to be only Metadata and Texts. But it is our reading that makes this separation. One could argue that we have one object only, Text, with derived metadata - references, catalog metadata - and with derived chunking - search hits, passage, full document.

Though, such a division would be harmful and we started dividing things to allow for a more result-type approach : catalog metadata, search hits, document's node, references. Those four result types have enough difference to be found in independent endpoint.

Document's node (ie. full Text and Passage) have the specificity to be drawn from a single Document and be retrieved one-by-one. The results are the result of retrieval only, it covers use case such as
1. A user wants to retrieve the full content of a text using an identifier
2. A user wants to retrieve a chunk of the content of a text using the identifier of a text and an identifier for the passage
Collection metadata follow their own scheme : they are potentially hierarchical, defines objects that are completely different from any text. They allow for browsing the catalog of a repository. They cover use cases such as
1. A user wants to browse the available content at a given repository
2. A user wants to retrieve cataloging information about content he has read
References are object of their own type. They possess a document identifier, their own identifier and potentially a type. They could include, in another round of the API, information such as token count and other such things. As such, they are completely unrelated to the kind of content Collection and Text have. They are easily represented as a list of strings. They cover use cases such as :
1. A user wants to browse the available chunks a text is made of.
2. A user wants to browse text chunks of equal size.
SearchHits differs in the way they are : neither Document's Node because they do not necessarily carry a passage identifier and are not 1-to-1 equal to given passage if they hold this information, neither a collection of Document'sNode in the sense they could be drawn from different documents. On top of that, the activity of search and its realities might change a lot from one repository to another, most of all because this specific endpoint can be turn off/not be made available. Use case
1. A user wants to find occurrences of the word lascivi in a repository
2. A user wants to find occurrences of the tri-gram (alea, ***, est) in the repository

On top of that, I would like to state a little more than to me, browsing, retrieval and search are really different activities because they do not cover the same mass of information :

retrieval is about retrieving "small amount of or a single" document(s) part in our case;
browse is about the intention to go through multiple pages of similar content with little to no original filtering (start from a collection or a specific document node) except on the go filtering (I am interested in this next object, tell me more about it);
search is about retrieving more or less specific informations based on some broad filters that do not depend directly from identifiers given by the owner of the repository.

Such different types of objects and such different types of activities forces us - for clarity - to separate what can be separated : ie References, TextualNodes and Collection Metadata retrieval and browsing on one side, search on the other. From a client side perspective, such a separation is much more newcomers-proof and allows for a clear separation of concerns.

I would like to add that, as Vincent Jolivet has told me many times (See Vincent, I listen !), Search is a really fuzzy word that covers many many realities (tag search, text search, what does text search means ? does attributes of original source if original source is in xml are searchable text ?) and as such is really complex to define for all projects that might adopt our API. On top of that, I'd add that implementing text search, even plain text search, has costs that are much much bigger than content retrieval and browsing. As such, it is important that it can be separated and clearly turned-off for small size projects.

PonteIneptique commented 6 years ago

For cleaning reason, we are closing old issues. Please feel free to comment if you think we should reopen.

distributed-text-services / specifications

Distinguish Passages, Documents, and References requests? #64