annotation / stam

Stand-off Text Annotation Model (STAM) is a data model for stand-off-text annotation where any information on a text is represented as an annotation. This repository contains the model's full specification, extensions, schemas, examples and documentation.
https://annotation.github.io/stam/
Creative Commons Attribution Share Alike 4.0 International
17 stars 2 forks source link

TEI serializer #30

Open tenzin3 opened 1 month ago

tenzin3 commented 1 month ago

Is there TEI serializer for STAM python?

Currently we have plan to develop an text api based on DTS specifications. But in that specification, the response text should be in TEI format and since we are using STAM as our data format. Having an inbuilt TEI serializer would be very helpful.

proycon commented 1 month ago

On Wed Sep 4, 2024 at 11:11 AM CEST, Tenzin Tsundue wrote:

Is there TEI serializer for STAM python?

Currently we have plan to develop an text api based on DTS specifications. But in that specification, the response text should be in TEI format and since we are using STAM as our data format. Having an inbuilt TEI serializer would be very helpful.

No, a serialisation from STAM to TEI would only be feasible if a STAM model would be strictly contrained to TEI's vocabulary, but STAM by definition allows any kind of vocabulary and doesn't predefine anything, TEI on the other hand defines a lot of vocabulary. So this is something that needs to be implemented on a higher-level and depends greatly on your use-case and how you decide to map whatever vocabulary you use to TEI. So you could build a library that does this (for your particular use-case) using stam-python .

We do have a tool for the reverse, mapping formats like TEI XML to STAM (via stam fromxml in stam-tools). That doesn't help you much here, but the TEI configuration there (https://github.com/annotation/stam-tools/blob/master/config/fromxml/tei.toml) may give you an idea what vocabulary mappings could look like.

dirkroorda commented 1 month ago

Not to speak of how to serialize annotations whose textual targets do not form a clean hierarchy!

awagner-mainz commented 1 month ago

TLDR; +1

More of a comment than a question/issue, I apologize:

I would also be interested in TEI XML roundtripping. So far, I know of these tools which each have their own approach or internal implementation, but they are not as generic as STAM:

I wish there was some standardization here. STAM seems to go in this direction and identifies as a "pivot model", but does not address the TEI XML serialization.

Perhaps an example of a mapping configuration and processor would be feasible for the STAM project? Or checking a STAM object if it is serializable to TEI XML as a precondition for such processing (f. ex. with stam-vocab or via json schema)? Hopefully at some point someone will come up with such things.

proycon commented 1 month ago

@awagner-mainz Thanks for your comment! That is most welcome. I'll have a closer look at all the links you provided.

I wish there was some standardization here. STAM seems to go in this direction and identifies as a "pivot model", but does not address the TEI XML serialization.

Perhaps an example of a mapping configuration and processor would be feasible for the STAM project? Or checking a STAM object if it is serializable to TEI XML as a precondition for such processing (f. ex. with stam-vocab or via json schema)? Hopefully at some point someone will come up with such things.

Yes, a mapping and processor are definitely feasible on top of STAM. I do wonder to what extend it can really be done generically, considering that one TEI often differs from the other and there are often use-case-specific issues. But a base template sounds feasible.

Stam-vocab might indeed provide the means to then test the vocabulary in a STAM model programmatically, once a mapping is defined. Aside from the vocabulary there is the overlap/hierarchy issue @DirkRoorda raised above, but that too is something that could be checked.

I can't promise I myself can get to this anytime soon, as I'd need a proper use-case to justify it for my employer. (the project that funded STAM thus-far has reached its end, but I have every intention of continuing). But I do recognize the value (and the challenges) in a TEI serialisation.

dirkroorda commented 1 month ago

Related to this, I am pondering about the following question:

We have 10,000 pages of 17th century Italian letters, upconverted from word + excel to TEI (with customisations). After that we convert it to Text-Fabric and from there we use machinery to mark up ca. 12,000 entities. The entities are delivered as a tsv file and then baked into a new Text-Fabric dataset.

But what about porting the entities back to TEI? The difficult thing is that the entities may span other elements, e.g. note, hi, pb, lb and possibly even p elements.

It is hard to weave those entities in, I think we have to fragment them around intervening markup, while we can also allow some markup within the entities. This is getting messy.

Also, there is every reason to assume that future runs of entity detection will result in different entities, so we have to regenerate all those tei files.

I prefer to leave the TEI as is, and deliver the entities as stand-off annotations to the TEI instead.

A way to do that could be address text in a concrete TEI file by means of the stack of its containing elements, e.g

tei[0]/text[0]/body[0]/div[2]/p[3]/_text_[4] = "foo"

where _text_[4] refers to the fourth text node of the element. You get multiple text nodes inside an element if the content is interrupted by other elements. text nodes are always taken maximally.

Then we can address every piece of text in Text-Fabric in this way, and from there we can generate the exact places in the XML file where the entities start and end, in terms of xpath-like expressions, so that XML processing tools can find them back.

But I really think that TEI is good for archiving, but that the results of processing TEI are not always fit to land in TEI format again, or, for that matter, in XML. Much better to use plain text plus standoff annotations.