Teanga is a database and system designed for NLP with pretrained language models.
Install with pip
as follows:
pip install git+https://github.com/teangaNLP/teanga2
For persistent storage, you can install the Rust version from https://github.com/teangaNLP/teanga.rs
The core idea of Teanga2 is the data model which descibes how the data is represented and processed by services and stored in Teanga2 backends.
The Teanga2 data end consists of a set of layers that provide annotations. Layers are typed into the following kinds
An example of each layer type is given in the above image and can be represented in YAML as follows:
_meta:
text:
type: characters
tokens:
type: span
base: text
upos:
type: seq
base: tokens
data: ["ADJ", ... "X"]
document:
type: div
base: text
default: [[0]]
author:
type: element
base: document
data: string
VC90:
text: "Teanga2 data model"
tokens: [[0,7], [8,12], [13,18]]
upos: ["PROPN", "NOUN", "NOUN"]
author: [[0, "John P. McCrae"], [0, "Somebody Else"]]
Each annotation in a Teanga 2 layer can have data. The folllowing types of data are available
target
property.As an example consider this (simplified) encoding of Universal Dependencies data
_meta:
text:
type: characters
words:
type: span
base: text
data: none
upos:
type: seq
base: words
data: ["DET","NOUN","VERB"]
dep:
type: seq
base: words
data: link
link_types: ["root","nsubj","dobj"]
target: dep
kOJl:
text: "this is an example"
words: [[0,4], [5,7], [8,10], [11,17]]
upos: ["DET", "VERB", "DET", "NOUN"]
dep: [[1, "nsubj"], [1, "root"], [2, "det"], [1, "dobj"]]
In addition, the metadata may define a value
for the layer. In this case,
the layer does not need to be specified in the document and will be assumed
to be the default value. The primary use for this is in defining document
layers as above
The corpus model of Teanga2 consists of a (ordered)
sequence of documents which in turn
consists of an (unordered) sequence of words. In addition, there are two meta
properties _meta
and _order
which give the layer descriptions and the
order of the documents in the text.
Each document is indexed by initial characters
the Base64 encoding of the SHA-256 of the UTF-8 representation of the text. The
text representation consists of all character layers ordered by their key with
the key appended before the text. Keys and text should be separated by a zero byte (\u0000
).
For example the following document:
en: Hello!
de: Guten Tag!
The string to encode is as follows:
from base64 import b64encode
from hashlib import sha256
rep = "de\x00Guten Tag!\x00en\x00Hello!\x00"
b64encode(sha256(rep.encode("utf-8")).digest()).decode("ascii")
'SpKHmfUJ1IkFXito5Me/ssLZ0Xx+ma5jjXTDb2qXs88='
By default only the first 4 characters of the key are used so the representation of this document would be
SpKH:
en: Hello!
de: Guten Tag!
All keys in the document should be unique and are used to check the validity of the input.
These keys are used by the _order
meta to give the order of documents. In
many serializations this may be omitted and instead the order of the keys in
the document may be used instead of an explicit order.
Teanga2 is linked-data-aware and this can be used to provide documentation to
the user. This can be done with the special _uri
property that can appear at
several points in the document
_meta:
_uri: https://jmccrae.github.io/teanga2/meta/basic.yaml
author:
base: document
data: string
_uri: https://jmccrae.github.io/teanga2/props/author.html
jjVi:
_uri: corpus/doc1.yaml
As a property directly under _meta
this indicates that this format will build
on another model and includes all the layers of that corpus into this corpus.
As a property of a layer, it indicates an description of the property. This should ideally refer to an HTML page with embedded Turtle or RDFa annotation.
If put directly as a document, this indicates that the document is stored in another file and the YAML document is effectively copied directly in as this document.