Closed balmas closed 6 years ago
From @PonteIneptique on September 14, 2017 7:13
We need to be careful of overlapping issue : this issue takes over #3 and #5 at the same time.
Outside of the search endpoint which I think should remain separate for documents/passage, I actually think that splitting up documents/passage is an unnecessary burden. If you have a passage keyword, then you reduce your document size, as simple as that.
If we take this approach, we probably need a URI parameter to specify the granularity for a passage request. Do we want to support granularity that corresponds to TEI elements, linguistic units, or both? If we decide to take this approach, I will open a separate issue for that question.
I honestly am not sure I understand this part. We already have, in your proposal, a specification for passage ID.
The identification system is up to the implementation : if one wishes to use word identifier, one can. If one wishes to use XPath, they can. All in all, those identifiers should be identified through the reference passage, so the 1-to-1 link with TEI element for example is absolutely unnecessary and would most probably prove harmful for other people to engage with our system.
From @hcayless on September 14, 2017 12:48
My argument to @jonathanrobie was that passages actually have a different return type than documents. Even if we do restrict supported documents to TEI-only, a TEI fragment is not a TEI document. It may not even be XML, or at least not well-formed XML. So, given that they're doing different things, we should consider splitting them up.
From @jonathanrobie on September 14, 2017 12:57
By granularity, I mean this: for a given match, do you want to return the passage? the sentence? the paragraph? something else? A URI parameter could specify that.
From @PonteIneptique on September 14, 2017 12:58
Now I see the rational behind the argument. I am not sold on it though.
I can definitely see the main reason : depending on the quoting system offered, it might be good to be able to send simplified resources (ie between two milestone, regardless of the structure of the XML). But the idea of not well formed XML is a bit... well, let's say I am afraid of the consequences of such freedom, and I can only have nightmares for people who will need to use the API as client and try to parse such things. Think about old HTML website (not xhtml) that you try to open. In python, they even had to build a parser and a library just for this kind of resource...
I would argue that even if it does not take the original build of the document, at least the general build of the response should look like a TEI document, even if it has only a substract, ie :
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Title</title>
</titleStmt>
<publicationStmt>
<p>Information about publication or distribution</p>
</publicationStmt>
<sourceDesc>
<p>Information about the source</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text><!-- And here the fragment -->
</text>
</TEI>
And I am not even sure the teiHeader would be required in this context.
From @PonteIneptique on September 14, 2017 12:59
@jonathanrobie Could you provide one or two clear examples (with mocked up data) for us to understand ? I really have difficulties to understand what you mean
From @hcayless on September 14, 2017 13:3
@PonteIneptique: If you deliver a TEI fragment between two milestones, it probably won't be well-formed unless you supply a wrapper element. I really, really dislike the idea of dressing up fragments as full documents. Not simply because it's overly verbose, but because you're actually making stuff up. You're advertising this thing as a full TEI document when it's not that at all.
From @PonteIneptique on September 14, 2017 13:6
@jonathanrobie There might be one thing that I understand : in certain type of content (translations for example) we might want to use range. Ie :
<l n="1">Lorem ipsum</l>
<l n="2">Video ipsum</l>
<l n="3">Audio ipsum</l>
could be translated over one line by a 19th century poet
<l n="1-3">Lorem ipsum</l>
And we might need to indicate, somehow, that the result is fuzzy matching. (Ie this is not only what has been asked)
From @PonteIneptique on September 14, 2017 13:9
@hcayless I do completely understand your argument, be sure of that :) And I see the positive point behind this (really I do)
But I also see the consequences for clients building, and these are terribles...
Note that
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<text><!-- And here the fragment -->
</text>
</TEI>
Is not that overly verbosed :)
But at least I think we should require valid XML. Is there any fragment guidelines for TEI ?
From @hcayless on September 14, 2017 13:31
None whatsoever! I expect by "valid" you mean well-formed. Validity is a whole other question, but a schema is required there. There are ways to validate well-formed chunks of TEI, but the question becomes: if you're going to fake a wrapper for your result, what does that look like?
Your suggestion won't work (or at least may not be valid), because <text>
has a limited content model (can't contain <p>
, for example). We could wrap it in something else non-TEI, but then you can't really claim it's a TEI document. Or we could tell clients that they should wrap the results themselves before parsing. Or, we could produce a more realistic faked-up, but valid TEI result. This would require some knowledge of the individual document structure, so we know what sort of wrapper to provide, but it's possible (indeed, I've done it before). But again, I'm uncomfortable, because we've faked a document.
If I ask for line 8 of a random papyri.info document, I'll probably get something like:
<lb n="8"/>ἔτει Ἀμοῦνις Κιαλὴ καὶ <supplied reason="lost">Ὧ</supplied>ρος <expan>Ταθή<ex cert="low">μιος</ex></expan> καὶ
I'm not opposed to wrapping results like this, but we'd need to decide how to do it. We might want to propose some fragment guidelines to TEI, in fact.
From @hcayless on September 14, 2017 13:34
I could see proposing a new TEI element (<fragment>
?) that could contain any TEI element or text. I'm sure it would be accepted if we could make a good argument for it.
From @PonteIneptique on September 14, 2017 13:35
Having lived both the provider (and headache)'s problem and the client problem (come one, not well-formed XML ???), I think your comment shows common understanding.
I know about the limitation for text node, technically the cost to build a little more (ie basically all can be contained in a div). But I also see the argument against fake document (though I do not adhere with it)
But I do love the <fragment>
node's idea. Really.
And then document endpoint (Which would merge document and passage) could be either a TEI document or a TEI fragment.
You could need fragmentCollection on top of fragment, for search result. Though, I am largely advocating for JSON/XML convertible scheme on the search result with some kind of ontology if needed... Search result in JSON seems totally natural to me :)
From @jonathanrobie on September 14, 2017 13:43
I asked the TEI list if there is a standard element for this:
http://tei-l.970651.n3.nabble.com/Returning-fragments-from-TEI-documents-td4030077.html
From @jonathanrobie on September 14, 2017 13:44
Search result in JSON seems totally natural to me :)
With the corresponding XML as text? That could certainly work.
From @PonteIneptique on September 14, 2017 13:47
Search result in JSON seems totally natural to me :)
With the corresponding XML as text? That could certainly work.
Well, with whatever kind of content actually for this one. The important part is to give you the identifier of the passage/document that contains the match, way more than the XML. This is search, not retrieval :)
I think my comment on #68 is relevant here, at least to an earlier part of the discussion about the distinction between passages and documents: See https://github.com/distributed-text-services/collection-api/issues/68#issuecomment-332840383
In today's call, we agreed on the following:
I think we need a clearly stated organizing principle that tells us the relationship between endpoints, resources, and affordances (operations you can perform on resources). Defining that relationship clearly will tell us how many endpoints we need and what each does, and will help us converge on our design.
Here are some issues I see in the current design:
To me, this is the heart of the current issue, and we cannot answer it one endpoint at a time.
My first comment would be that separating the search endpoint was decided upon discussion at meeting 1.
My second remark would be the following : our objects seems to be only Metadata and Texts. But it is our reading that makes this separation. One could argue that we have one object only, Text
, with derived metadata - references, catalog metadata - and with derived chunking - search hits, passage, full document.
Though, such a division would be harmful and we started dividing things to allow for a more result-type approach : catalog metadata, search hits, document's node, references. Those four result types have enough difference to be found in independent endpoint.
Document's node
(ie. full Text
and Passage
) have the specificity to be drawn from a single Document
and be retrieved one-by-one. The results are the result of retrieval only, it covers use case such as
Collection metadata
follow their own scheme : they are potentially hierarchical, defines objects that are completely different from any text. They allow for browsing the catalog of a repository. They cover use cases such as
References
are object of their own type. They possess a document identifier, their own identifier and potentially a type. They could include, in another round of the API, information such as token count and other such things. As such, they are completely unrelated to the kind of content Collection
and Text
have. They are easily represented as a list of strings. They cover use cases such as :
SearchHits
differs in the way they are : neither Document's Node
because they do not necessarily carry a passage identifier and are not 1-to-1 equal to given passage if they hold this information, neither a collection of Document'sNode
in the sense they could be drawn from different documents. On top of that, the activity of search and its realities might change a lot from one repository to another, most of all because this specific endpoint can be turn off/not be made available. Use case
lascivi
in a repositoryOn top of that, I would like to state a little more than to me, browsing, retrieval and search are really different activities because they do not cover the same mass of information :
Such different types of objects and such different types of activities forces us - for clarity - to separate what can be separated : ie References, TextualNodes and Collection Metadata retrieval and browsing on one side, search on the other. From a client side perspective, such a separation is much more newcomers-proof and allows for a clear separation of concerns.
I would like to add that, as Vincent Jolivet has told me many times (See Vincent, I listen !), Search is a really fuzzy word that covers many many realities (tag search, text search, what does text search means ? does attributes of original source if original source is in xml are searchable text ?) and as such is really complex to define for all projects that might adopt our API. On top of that, I'd add that implementing text search, even plain text search, has costs that are much much bigger than content retrieval and browsing. As such, it is important that it can be separated and clearly turned-off for small size projects.
For cleaning reason, we are closing old issues. Please feel free to comment if you think we should reopen.
From @jonathanrobie on September 13, 2017 20:36
In the current API, we distinguish
/collections
, which contains metadata for both documents and collections, from/documents
, which contains the text of documents per se.In a conversation today, Hugh suggested that we break this down as follows:
collections
: metadata for documents and collectionsdocuments
: complete documents, which are returned as a unitpassages
: passages from documents, which are returned as partial documents with pagingreferences
: references associated with documentsThe following operations would be supported:
kw=pair
): supported for collections. Searches metadata, returns the metadata for a document or collection, which includes the URI needed to retrieve it.q=string
): supported for collections, documents, and passages. For collections, full text is applied to metadata, and metadata for a collection or document is returned. For documents, full text is applied to document text, and the document is returned. For passages, full text is applied to document text, and passages that match are returned (see note on granularity below).next
andprev
links in the response header.If we take this approach, we probably need a URI parameter to specify the granularity for a passage request. Do we want to support granularity that corresponds to TEI elements, linguistic units, or both? If we decide to take this approach, I will open a separate issue for that question.
Copied from original issue: distributed-text-services/distributed-text-services.github.io#9