TeangaNLP / teanga2

Teanga a dó
Apache License 2.0
0 stars 0 forks source link

Teanga Core and DB

Documentation

Teanga is a database and system designed for NLP with pretrained language models.

Install

Install with pip as follows:

pip install git+https://github.com/teangaNLP/teanga2

For persistent storage, you can install the Rust version from https://github.com/teangaNLP/teanga.rs

Teanga 2 Data Model

The core idea of Teanga2 is the data model which descibes how the data is represented and processed by services and stored in Teanga2 backends.

Layers

The Teanga2 data end consists of a set of layers that provide annotations. Layers are typed into the following kinds

Examples of Teanga 2 Image Types

An example of each layer type is given in the above image and can be represented in YAML as follows:

_meta:
    text:
        type: characters
    tokens:
        type: span
        base: text
    upos:
        type: seq
        base: tokens
        data: ["ADJ", ... "X"]
    document:
        type: div
        base: text
        default: [[0]]
    author:
        type: element
        base: document
        data: string
VC90:
    text: "Teanga2 data model"
    tokens: [[0,7], [8,12], [13,18]]
    upos: ["PROPN", "NOUN", "NOUN"]
    author: [[0, "John P. McCrae"], [0, "Somebody Else"]]

Data

Each annotation in a Teanga 2 layer can have data. The folllowing types of data are available

As an example consider this (simplified) encoding of Universal Dependencies data

_meta:
  text:
    type: characters
  words:
    type: span
    base: text
    data: none
  upos:
    type: seq
    base: words
    data: ["DET","NOUN","VERB"]
  dep:
    type: seq
    base: words
    data: link
    link_types: ["root","nsubj","dobj"]
    target: dep
kOJl:
  text: "this is an example"
  words: [[0,4], [5,7], [8,10], [11,17]]
  upos: ["DET", "VERB", "DET", "NOUN"]
  dep: [[1, "nsubj"], [1, "root"], [2, "det"], [1, "dobj"]]

In addition, the metadata may define a value for the layer. In this case, the layer does not need to be specified in the document and will be assumed to be the default value. The primary use for this is in defining document layers as above

Corpus Model

The corpus model of Teanga2 consists of a (ordered) sequence of documents which in turn consists of an (unordered) sequence of words. In addition, there are two meta properties _meta and _order which give the layer descriptions and the order of the documents in the text.

Each document is indexed by initial characters the Base64 encoding of the SHA-256 of the UTF-8 representation of the text. The text representation consists of all character layers ordered by their key with the key appended before the text. Keys and text should be separated by a zero byte (\u0000). For example the following document:

en: Hello!
de: Guten Tag!

The string to encode is as follows:

from base64 import b64encode
from hashlib import sha256

rep = "de\x00Guten Tag!\x00en\x00Hello!\x00"
b64encode(sha256(rep.encode("utf-8")).digest()).decode("ascii")
'SpKHmfUJ1IkFXito5Me/ssLZ0Xx+ma5jjXTDb2qXs88='

By default only the first 4 characters of the key are used so the representation of this document would be

SpKH:
    en: Hello!
    de: Guten Tag!

All keys in the document should be unique and are used to check the validity of the input.

These keys are used by the _order meta to give the order of documents. In many serializations this may be omitted and instead the order of the keys in the document may be used instead of an explicit order.

Documentation and RDF

Teanga2 is linked-data-aware and this can be used to provide documentation to the user. This can be done with the special _uri property that can appear at several points in the document

_meta:
    _uri: https://jmccrae.github.io/teanga2/meta/basic.yaml
    author:
        base: document
        data: string
        _uri: https://jmccrae.github.io/teanga2/props/author.html
jjVi:
    _uri: corpus/doc1.yaml

As a property directly under _meta this indicates that this format will build on another model and includes all the layers of that corpus into this corpus.

As a property of a layer, it indicates an description of the property. This should ideally refer to an HTML page with embedded Turtle or RDFa annotation.

If put directly as a document, this indicates that the document is stored in another file and the YAML document is effectively copied directly in as this document.