CLARIAH / clariah-plus

This is the project planning repository for the CLARIAH-PLUS project. It groups all technical documents and discussions pertaining to CLARIAH-PLUS in a central place and should facilitate findability, transparency and project planning, for the project as a whole.
9 stars 6 forks source link

Formulate a Stand-off Text Annotation Model #137

Closed proycon closed 1 year ago

proycon commented 1 year ago

In the scope of FAIR Annotation, we started formulating a Stand-off Text Annotation Model (STAM) which could serve as the low-level and standalone foundation for various applications that deal with text annotation, and as a solution for researchers/developers to use with their own annotation paradigms (whatever they may be).

The project can be found here: https://github.com/annotation/stam (note that at this point everything is still very fresh and subject to change at any time!). It is currently being discussed internally at KNAW HuC's "Team Text" as we are in the initial conceptual phase, but of course everybody is welcome to comment and join in. Please comment in this thread what you think of such a project (or just leave a thumbs-up or thumbs-down if you want to be brief).

I'm envisioning a text annotation model with a high-performant low-level library implementation and a high reuse potential. It could act a central pivot model in which more complex models used in CLARIAH like FoLiA & Text Fabric, but also TEI and W3C Web Annotations could be expressed (STAM itself is agnostic to vocabularies after all). It might act as a suitable model to hold data after parsing and 'untangling' more complex models. Easy serialisation to W3C Web Annotations and therefore by definition to RDF is also one of the aims of this model (or rather of an extension of this model, RDF is not a prerequisite for STAM!).

An initial library implementation for STAM would focus on time and memory efficiency. Once we have such a solid foundation, it prevents researchers/developers having to reimplement the same thing over and over again. Instead they can benefit on a single shared library to do the heavy lifting (I'm proposing a Rust implementation with a Python binding). It should also enable some simple command-line tools to be written for basic parsing, serialisation, conversion and querying tasks.

If this projects turn out feasible, this task might eventually subsume some of the others I formulated for FAIR Annotation like proycon/folia#102 and #81, as there is overlap there.

I would suggest we also propose this idea to WP3 and WP6 after initial internal deliberations, WP5 may also be interested even though our primary focus is text.