Closed TEITechnicalCouncil closed 4 years ago
This issue was originally assigned to SF user: stadlerpeter Current user is: peterstadler
TEI already provides many elements for adding various kinds of standoff annotations (<link>, <certainty>, <join>, <alt>, <fs>, etc.) It doesn't provide any particular place for storing all such annotations, though this has been proposed at various times (in the days of SGML there was a proposal for something called a "LinkDataBlock" or <ldb> which I quite liked). I think the basic question would be : what advantage is there in creating such a special block? what does it provide that simply putting a <div type="links"> inside the <front> <body> or <back> doesn't? what use cases are there?
Original comment by: @lb42
Original comment by: @lb42
the massive arguments against
<div type="links"> inside the <front> <body> or <back>
are that
a) it relies on assumptions about values of @type
, which must be entirely abhorrent to us. we cannot expect processors to "just know" things like that. "No Magic Here" must be our mantra.
b) a <div> is a "subdivision of the front, body, or back of a text", NOT an arbitrary container as HTML's <div> is. This bunch of standoff stuff are not a subdivision of the text we are encoding.
If you add a <div> full of links, and then ask "so how many subdivisions of the text are there", the answer will a spurious 1 more than expected.
So I am much in favour of the entirely unambiguous freestanding container for this stuff. It costs us nothing, makes life much easier for processors, and provides part of the much-needed better guidance and support standoff-ish people.
Original comment by: @sebastianrahtz
I would be also against the idea of having <div type="..."> as a place for the stand-off annotations. The stand-off annotations are becoming more and more an essential piece of the encoding of any piece of textual information. Following the same philosophy as used for <facsimile> and for <sourceDoc>, the stand-off annotations (that are not a piece of the text themselves) must be stored in a "separate" place, different than the text. This makes much more clear the nature of the information and helps to encode and process it.
Original comment by: sf_user_posejavier
I tend to be with Sebastian and Javier and would definitely support the introduction of a new element. It is inline with similar mechanisms embedding representations external to the text proper and would bring so much fresh air for corpus linguistic people striving constantly to find a decent solution as to where to put such data.
Original comment by: @laurentromary
OK, I agree that Sebastian's arguments are persuasive. Do we want this element to be another sibling of <text>, or would it be more plausible to put it inside <encodingDesc> or elsewhere in the <teiHeader>?
Original comment by: @lb42
I see this as a child of text, since it is no metadata, but an additional layer to the data represented in the body a little like a facsimile is preliminary data to the transcribed content)
Original comment by: @laurentromary
I would tend to put it as a new element between the <teiHeader> and the <text>, similarly to the <facsimile> or <sourceDoc>. The following reasons: 1) I wouldn't include the stand-off annotations as part of the teiHeader, because teiHeader must supply the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text.The stand-off annotations comprise "extra" information, like the annotations themselves, that are not part of the text being described or metadata associated to it. Therefore, in order to clarify this difference I would suggest to put the annotations outside the <teiHeader>
2) I woudn't include the stand-off annotations as part of the text, because the <text> contains a single text of any kind, whether unitary or composite, for example a poem or drama, a collection of essays, a novel, a dictionary, or a corpus sample. In this case the stand-off annotations are text but not the text being encoded, but metadata and text related to the text being encoded. Therefore, in order to clarify this difference I would suggest to put the annotations outside the <text>
Taking these reasons into account, I would tend to think that the best place for the standoff annotations (metadata + text associated to the text being encoded) would be in a new "area"between the <teiHeader> and the <text>. This is similar to the idea of, for example, <facsimile> which contains information that is nor directly the text being encoded neither its metadata, but a new "object" related to said text, i.e. the information about the facsimile of the text.
Original comment by: sf_user_posejavier
For the record. I meant like Javier: an element similar to <facsimile>, hence between header and text, namely a member of model.resourceLike
Original comment by: @laurentromary
The Council meeting of 2012-09 agreed to the underlying request, to create this sibling of <teiHeader>; but without a decision about what to call it (<standoff>?) or what the content model should be. The latter needs a working party to agree a detailed spec.
Original comment by: @sebastianrahtz
<annotations>? The thing should have a very simple content model based on a model class, so that external vocabularies can be included easily as a customization. From a TEI internal point of view we should have a couple of examples where we gather typical annotation examples from the guidelines (spanGrp's etc.)
Original comment by: @laurentromary
Hi Sebastian (or the responsible person), regarding the working party to agree a detailed specifications. Could it be possible that I take part in such a working team? I would be very interested. Regards,
Original comment by: sf_user_posejavier
yes indeed, this will be an open group, I think. we didnt decide at the meeting who would convene it
Original comment by: @sebastianrahtz
Thanks !!! Then I wait until getting some feedback about the final composition of the team. Do you have an idea how long it will take take to build such a group and start working?
Original comment by: sf_user_posejavier
Dear Javier, thanks for bringing the idea to the fore, I should have done that after publishing the paper, but waited for a "good moment", i.e. for an official opening of a LingSIG space at Sourceforge, but it's much better to have run this across the Council earlier, and get the green light.
In a broad sense, the group that you might want to join to elaborate on the content of <standOff> [1] is the LingSIG:
http://wiki.tei-c.org/index.php/SIG:TEI\_for\_Linguists
I will update the page with new info, at the latest during the upcoming TEI Conference, possibly earlier.
Original comment by: @bansp
As for naming, I vaguely recall that we may have decided to go for <standoff> (no camel case), after Sebastian pointed out that it functions as a single word.
I have now become convinced that I must have applied a lot of wishful thinking when I interpreted our decision as "go ahead and put it in, we'll worry about the exact content later". Now I believe that I may have been the only one who was prepared to see this in the upcoming release, possibly because others were actually using their brains.
OK, so now I interpret our decision as, basically, "take this to LingSIG, possibly under the supervision of the Council group enumerated in the minutes1". I will -- after tomorrow's release, I'll open the LingSIG space on SF.
(Everyone OK with that?)
Original comment by: @bansp
Independently on how fast lingSIG will move ahead with this, I would have the feeling that we should implement this <standoff> step by step and be very pragmatic. For a start I would just define <standoff> with a simple content model based on a class model.standoffPart. Next step is to feed the class with TEI elements that are relevant there, namely things like spanGrp, interpGrp, linkGrp. Than see how the community implements this and requires further content (prediction: we'll have to deal with internal organisation of the thing, à la <div>; I promise a good can of worms, but not a reason to move ahead now).
Original comment by: @laurentromary
My only concern is putting this can of worms open into the upcoming release. Can't recall anything like this done before (though I bet you might!).
Original comment by: @bansp
It will not be a can of worms at this stage and we badly need the mechanism. I would suggest to implement it for the next release like suggested in my previous post. General agreement?
Original comment by: @laurentromary
But a single element is not a mechanism...
I'm going to open LingSIG space on SF after this release, basically copying the tagged branch, to be experimented on. How about focusing on the mechanism there, first, and proposing a sketch of the contents at the next F2F (which I may be present at or not, depending on the elections, but will still inform as a LingSIG convener).
For one thing, I very strongly disagree with Javier's idea of putting metadata into <annotation> located under <standoff>. This is what the header is for...
(Incidentally, I'm really glad you're saying that we need this badly, would you read the paper, too? I'm citing Nancy and you like crazy there, you're gonna like it, I'm calling you there my favourite French angel, for example)
Original comment by: @bansp
But a single element is not a mechanism...
I'm going to open LingSIG space on SF after this release, basically copying the tagged branch, to be experimented on. How about focusing on the mechanism there, first, and proposing a sketch of the contents at the next F2F (which I may be present at or not, depending on the elections, but will still inform as a LingSIG convener).
For one thing, I very strongly disagree with Javier's idea of putting metadata into <annotation> located under <standoff>. This is what the header is for...
(Incidentally, I'm really glad you're saying that we need this badly, would you read the paper, too? I'm citing Nancy and you like crazy there, you're gonna like it, I'm calling you there my favourite French angel, for example)
Original comment by: @bansp
Hi Piotr, regarding your last comment, what type of metadata you don't want to have under <standoff>? What are you thinking to have in <standoff>? As far as I understood the annotation should have information like the pointer to which it refers, the data of the annotation and other information like author, date... I guess (correct me if wrong) that you are of the opinion that the information like author, date, and possible other (?) that you refer as metadata, should not be under <standoff>. If this is the case, I don't agree with it. As far as I understand, the <header> supplies the descriptive and declarative information making up an electronic title page prefixed to every TEI-conformant text. This element should contain all the metadata refering to the text it self. Now, when we consider the <annotation> this is another kind of "object" related to the text (like <facsimile> in some sense), and (in my opinion) it would be better to keep the metadata associated to each one separately. If we put the metadata in the <header> we will end up mixing different information (metadata) under the same structure, what can be very confusing. Since the metadata of the annotations is basically, author, date and a couple more of elements, I would be more in favour of keeping all encapsulated under the annotation.
Original comment by: sf_user_posejavier
Hi Javier (and Laurent),
You're making my point, in a way: we're still far from a uniform vision of this. Part of the issue is that we are making different initial assumptions. In the article that Javier has quoted, my starting point (well, one of them) is simplifying the overall handling of annotation documents, by allowing <teiHeader, standoff> structures ('<>' for an ordered pair, this time). I mentioned that, for full flexibility and parallelism, we could also allow for <teiHeader, standoff, text>.
Your concern regarding the header is partly valid. I don't see the problem with a 'doubled' author information as forcing a solution whereby some formal metadata is squeezed below the header -- one solution could be: if there's a conflict, keep them separated and linked virtually, in a fully standoff manner (after all, standoff was a solution to, among others, information container overlap).
This is just a sample of the kind of discussion that we may have on this issue, and the possible compromises that we can come up with. Which is, as I said at the very beginning of this not, something that I see as a strong argument against implementing this fast and worrying later.
I don't buy the argument that "we need it NOW". We have needed it NOW for the past, roughly counting, 17 years -- since the CES, if not before. Well, we will have it in a few days, in the LingSIG part of SF, open to experimentation and discussion. I believe that this does mean progress, without putting the entire architecture at risk, also from the point of view of the community's response to the Council's doings.
Original comment by: @bansp
For the content model of <standoff>, don't forget the recently-added <listApp>.
Original comment by: @martindholmes
Hi, has been the Working Group created? how can I join it?, is there a mailing list?
Original comment by: sf_user_posejavier
I entirely agree that this new element doesn't belong inside <text
>. However, I am less convinced that it doesn't belong inside <teiHeader
>. Javier says above "The stand-off annotations comprise extra information, ... that are not part of the text being described or metadata associated to it" But surely that extra information is precisely "metadata" -- information about the text? And we already have a lot of "standoffish" elements in the header, which are nothing to do with the "titlepage" aspects e.g. <particDesc
> <listPlace
> etc. I think there's a big difference between things like <facsimile
> or <sourceDoc
> which are different views of the same textual object, and the header which is at a different "meta" level.
Original comment by: @lb42
I don't think ancillary textual content is the same thing as metadata at all. Editorial annotations, prosopographical information etc. is not part of the core source document, but it is part of the text in another sense. It would not normally be put in a library catalogue, and it would traditionally be printed either on the page (as footnotes etc.) or in appendices. So I don't think it belongs in teiHeader.
Original comment by: @martindholmes
As I already indicated, I think that the annotation shouln't be part of the teiHeader. It is important to stablish a conceptual differentiation between the metadata of the text (i.e. information about the core nature of the source text) and the annotations (i.e. added information NOT "naturally" linked to the original text and which has been created, probably latter on, with some specific meaning not directly related to the specific nature of the text). In some way the annotations could be seen as "postit" markers that provide information about the source text or parts of said source text. It would be some how estrange to put the information of these postits as part of the teiHeader of the source document (!!!). Following this simile, one could think the annotations as very small kind-of-documents linked to the source text. Now, one interesting question would be, when an "annotation" has enough "entity" to be considered as an independent document? This issue could be clarified latter on. For the moment, and in view of the previous reasons, I would think that the annotations shouldn't be part of the teiHeader
Original comment by: sf_user_posejavier
A related question in my mind is that of other kinds of descriptive data or annotation, such as RDF triples defining properties of the text content in some ontology. Would those be included within the proposed element?
On 18/06/13 13:56, Javier Pose wrote:
As I already indicated, I think that the annotation shouln't be part of the teiHeader. It is important to stablish a conceptual differentiation between the metadata of the text (i.e. information about the core nature of the source text) and the annotations (i.e. added information NOT "naturally" linked to the original text and which has been created, probably latter on, with some specific meaning not directly related to the specific nature of the text). In some way the annotations could be seen as "postit" markers that provide information about the source text or parts of said source text. It would be some how estrange to put the information of these postits as part of the teiHeader of the source document (!!!). Following this simile, one could think the annotations as very small kind-of-documents linked to the source text. Now, one interesting question would be, when an "annotation" has enough "entity" to be considered as an independent document? This issue could be clarified latter on. For the moment, and in view of the previous reasons, I would think that the annotations shouldn't be part of the teiHeader
[feature-requests:#378] http://sourceforge.net/p/tei/feature-requests/378/ Encoding of Standoff annotations
Status: open Labels: TEI: New or Changed Element Created: Sun Aug 26, 2012 10:01 PM UTC by Javier Pose Last Updated: Tue Jun 18, 2013 12:35 PM UTC Owner: Piotr Banski
The annotation of documents using standoff annotations is a very useful and flexible methodology. Nevertheless, TEI does not have any specific elements for encoding this information. In most of cases, the standoff annotations are stored as external TEI files linked to the text being annotated. Nevertheless, this way of storing the standoff annotations is very rigid and presents numerous problems, for example, for indexing or searching the corpus of documents using the information of the annotations. In these cases, it would be very useful to have the standoff annotations INSIDE the TEI documents being annotated (!!!).
Therefore, it is suggested to include define a new set of TEI elements specifically dedicated to the encoding of the standoff annotations.
The idea would be to store the standoff annotations between the
and the , following the same philosophy as used for the and for (in some way these two elements could also be considered as a "type" of annotation). For the standoff annotation, the structure could be: This structure would provide the extra advantage of allowing to annotate the information at different TEI levels in a natural manner. So for more complicated TEI documents having different hierarchical levels, the standoff annotations could be encoded as follows: ... [information of the annotations] ... This structure would also provide the extra advantage of allowing to annotate, not only the text of the document, but also the metadata of the different hierarchical levels of the TEI document. The specific encoding of the annotations inside ... ... ... ... ... ... ... could be as follows: As a last remark it is also suggested to allow inside the ... ... ... [other data needed]the TEI element
Original comment by: @lb42
Hello, I have been thinking these days about a possible structure for the "standoff" element and a general framework for encoding standoff annotations in TEI.
In order to have a more clear proposal, I wrote a working document explaining in detail a proposed structure for encoding standoff annotations in
Regards,
Original comment by: sf_user_posejavier
Can I suggest you publish this more widely, Javier, and tell people on TEI-L about it? this really deserves careful reading by many people.
Original comment by: @sebastianrahtz
Got a notification of Sebastian's post, but not of Javier's. Earlier today, I received Javer's document and I find it impressive. We'll try to open it for discussion from the SIG pages, if that is OK with everyone.
Original comment by: @bansp
Thanks Javier for your very detailed description of the proposal. I have a couple of immediate responses to it:
<standoff>
. There's no reason to abbreviate, because this is not an element that's going to crop up dozens of times in a document, and it's really not clear what <stf>
might mean if you don't already know.<teiHeader>
inside <teiCorpus>
. I would imagine that a lot of the kind of data appearing in this element would be applicable to all the documents in a corpus.Original comment by: @martindholmes
Hi Martin, regarding your comments:
Original comment by: sf_user_posejavier
Original comment by: @bansp
Original comment by: @bansp
(Resetting the priority back to "5", with apologies for the incident, while pointing my finger at the SF maintainers.)
Original comment by: @bansp
There is going to be a meeting devoted to this issue, in Jan/Feb 2014 in Berlin, with, minimally, Javier, Laurent and Piotr, hopefully also Andreas Witt, hopefully a Council representative, and surely several other colleagues interested in the topic from various angles.
The Council is obviously going to be informed about the results of that meeting, either via its representative, or with a report from us. We hope that these results will be taken into consideration when the content of <standOff> (or <standoff>, or whatever it ends up called) is decided on.
Original comment by: @bansp
Some quick comments …
I very much like the idea of a new linked-data-block kind of container element, although I'm not sure that <standoff> is sufficiently generic. One can imagine that a lot of useful stuff would get tucked into this space. The already mentioned <spanGrp>, <interpGrp>, <linkGrp>, of course; but also contextual information (<listPerson>, <listPlace>, etc.), specialized annotations (<listChange>, <witDetail>), and “phantom” text (e.g., <castItem>s that do not appear in the source text, the words spelled out by an acrostic).
I mildly prefer making this new container an optional child of <text> before <front>, but could easily be talked into the “between <teiHeader> and <text>” idea. I shy away from putting it in the <teiHeader>.
I have not figured out what the difference is between the suggested <annotation> element and the TEI <note> element. At first blush, they look the same.
I have not paid attention to the OAC for some time now, but we should take a peek at what they're doing and also brush up on XStandoff before settling on anything.
Original comment by: @sydb
This is just to note that we have just finished a very successful (imnsho) meeting devoted to these issues, hosted and chaired by Laurent Romary at HUB, with life-maintaining infrastructural support from Carolin Odebrecht. The results will be reported on to the Council by Peter Stadler, who has been of great help, by keeping the minutes and editing the ODD, and raising valid issues all at the same time. Members of the LingSIG, which was happy (well, eager) to provide the small-scale "institutional" umbrella for the relevant post-meeting activities, will be notified and queried as well. The same goes for the TEI community at large, in due time (which the Council will probably determine for us).
The ball is rolling and we will be hoping for the Council's comments and hints after the upcoming F2F meeting, if not earlier.
Original comment by: @bansp
From Nov. 2014 F2F: The last information we have is the minutes of the WG meeting in Berlin from PWS: https://docs.google.com/document/d/1QqJK08sff4ral0tadmNXs0j0VN5Yp-O7_mIsv2f-WjQ/edit#heading=h.odkayr64o4dg
Council is waiting for a proposal - pls link information so we can review it.
Original comment by: @emylonas
Dear all, I'm very interested in using TEI-standoff for linguistic annotation for historic data (manuscripts, 15th–17th century). I'd like to ask what's the current status on that proposal?
I put together a Github project with ODD spec and examples for review and discussion: https://github.com/laurentromary/stdfSpec It would be good to move a head slowly with this.
Original comment by: @laurentromary
@PeterHinkelmanns
: please provide possible examples that could be used as possible application of the proposed element.
Original comment by: @laurentromary
Original comment by: @hcayless
Assigning to Peter to get this moving again.
Original comment by: @hcayless
Standoff has moved to a proposed implementation now available at https://github.com/laurentromary/stdfSpec
Original comment by: @lb42
Council working group (PFS, LB, MH, FC, SM, PWS) created an alternative proposal as the "Ann Arbor" branch at https://github.com/laurentromary/stdfSpec/tree/AnnArbor
Original comment by: @peterstadler
Council suggests @laurentromary and @peterstadler continue work on this, and to keep #539 in mind as they do to make sure these two proposals are not duplicating one another.
The annotation of documents using standoff annotations is a very useful and flexible methodology. Nevertheless, TEI does not have any specific elements for encoding this information. In most of cases, the standoff annotations are stored as external TEI files linked to the text being annotated. Nevertheless, this way of storing the standoff annotations is very rigid and presents numerous problems, for example, for indexing or searching the corpus of documents using the information of the annotations. In these cases, it would be very useful to have the standoff annotations INSIDE the TEI documents being annotated (!!!).
Therefore, it is suggested to include define a new set of TEI elements specifically dedicated to the encoding of the standoff annotations.
The idea would be to store the standoff annotations between the <teiHeader> and the <text>, following the same philosophy as used for the <facsimile> and for <sourceDoc> (in some way these two elements could also be considered as a "type" of annotation).
For the standoff annotation, the structure could be:
<TEI> <teiHeader> ... </texHeader> <standoff> [information of the annotations] </standoff> <text> ... </text> </TEI>
This structure would provide the extra advantage of allowing to annotate the information at different TEI levels in a natural manner. So for more complicated TEI documents having different hierarchical levels, the standoff annotations could be encoded as follows:
<teiCorpus> <teiHeader> ... </teiHeader> <TEI> <teiHeader> ... </texHeader> <standoff> ... </standoff> <text> ... </text> </TEI> <TEI> <teiHeader> ... </texHeader> <standoff> ... </standoff> <text> ... </text> </TEI> </teiCorpus>
This structure would also provide the extra advantage of allowing to annotate, not only the text of the document, but also the metadata of the different hierarchical levels of the TEI document.
The specific encoding of the annotations inside <standoff> could be as follows:
<standoff> <annotation type="..." subtype="..."> <author>...</author> <date>...</date> <ptr>...</ptr> [other data needed] </annotation> </standoff>
As a last remark it is also suggested to allow inside the <annotation> the TEI element <figure> in order to facilitate the annotation not only of textual information, but also of images and formulas.
Conclusion: the proposed structure for the encoding of standoff annotations in TEI provides the following advantages:
- allows to encode standoff annotations under TEI in a natural manner, which is not the case at the moment
searching said documents
Remark: this idea has been already suggested by Piotr Bański in his article "Why TEI stand-off annotation doesn't quite work and why you might want to use it nevertheless", in http://www.balisage.net/Proceedings/vol5/html/Banski01/BalisageVol5-Banski01.html
Original comment by: sf_user_posejavier