annotation / stam

Stand-off Text Annotation Model (STAM) is a data model for stand-off-text annotation where any information on a text is represented as an annotation. This repository contains the model's full specification, extensions, schemas, examples and documentation.
https://annotation.github.io/stam/
Creative Commons Attribution Share Alike 4.0 International
14 stars 2 forks source link

Annotate existing xml resources? #6

Closed eduarddrenth closed 1 year ago

eduarddrenth commented 1 year ago

stam might be very usefull for existing xml resources, of which there are many. This could be left to extenders or of course not be considered at all and let stam be purely text based.

instead of converting xml to text first (must often be tailor made I expect, whitespace handling and tags to convert to text) and use that as basis to annotate, you could consider using xpath for pointing. Perhaps analogous to Cursor using XpathBegin, XpathEnd.

At the moment xml extenders of stam must provide there own model and implementation for this part (except datasetselector): image

I think making Cursor abstract simplifies adding support for xml/xpath (and more?).

proycon commented 1 year ago

Such a selector would be indeed conceivable as an extension; it's deliberately not in the core model, as we aim for a more minimalistic base model that contains all that is needed, but not more.

When it comes to working with XML, we envision 'untangle' software that takes the existing XML (e.g. TEI, FoLiA) as input, and converts it to plain text and fully stand-off annotations using STAM (in whatever further vocabulary one wants). As you already remark though, such conversion software is not trivial and must often be custom-made for specific formats. It does result in full stand-off notation, otherwise you end up with a hybrid mix which may complicate matters.

Models such as W3C Web Annotations already provide a wider variety of selectors, including for XML. Similarly, they have selectors for imagery and I guess even audio (one can envision selector on timestamps rather than character-offsets). It may be tempting to make STAM more generic but we do intend to focus on the core business of representing standoff annotation on text, and doing that as well and efficient as we can.