High-level API design - Githubissues

annotation / stam

Stand-off Text Annotation Model (STAM) is a data model for stand-off-text annotation where any information on a text is represented as an annotation. This repository contains the model's full specification, extensions, schemas, examples and documentation.

https://annotation.github.io/stam/

Creative Commons Attribution Share Alike 4.0 International

17 stars 2 forks source link

High-level API design #15

Closed proycon closed 1 year ago

proycon commented 1 year ago

I want to take the next step towards designing a good high-level API for STAM. In the current implementation, things have grown somewhat organically, but we've reached a stage where things are becoming cluttered or confusing if not well designed, and where some expected high-level methods are still clearly missing.

Please read my API proposal and comment here in this issue. The document is not normative for STAM itself (any implementation may decide to do things differently); STAM as such prescribes only a data model and expected functionality for implementations, but not an API.

I also want to more clearly separate the internal API in stam-rust from the higher-level API that is exposed, right now too many internals are exposed publicly in the library. This means I want to close off parts of the low-level API, such a decoupling layer allows for easier internal changes without affecting the outside world.

It does imply there's going to be a fairly big API breakage for next stam-rust and stam-python releases, but that was coming anyway because of other changes, and at this stage that is still manageable. I hope to cover most breaking changes in a single release.

The high-level API design also relates to our aim to formulate a query language (#12) and implementation thereof (annotation/stam-rust#14), because most of the methods are related to searching. The proposed API sits at one level below a full query implementation (which was already underway), but if done right, the query implementation itself becomes less urgent and can delegate a lot to the new high-level API methods.

proycon commented 1 year ago

In order to get a feel for this API, I translated Dirk's example code from https://nbviewer.org/github/ETCBC/bhsa/blob/master/programs/stam-nu.ipynb to it. This is code to find "all phrases where the first and the last words have the same grammatical number":

First the Text Fabric code:

results = []

for p in F.otype.s("phrase"):
    ws = L.d(p, otype="word")
    if len(ws) < 2:
        continue
    fi = ws[0]
    la = ws[-1]
    if F.nu.v(fi) != F.nu.v(la):
        continue
    results.append((p, fi, la))

Now almost the exact same structure with the new STAM API using the python binding (proof of concept, details may vary still):

results = []
for phrase in store.annotations_by_data(set="someset", key="type", value="phrase", textual_order=True):
    words = phrase.annotations_by_data_in_targets(set="someset", key="type", value="word", textual_order=True)
    if len(words) < 2:
        continue
    firstword = words[0]
    lastword = words[-1]
    for data, annotation in firstword.data_about(set="someset",key="nu"):
        if lastword.test_data_about(data):
            results.append((phrase,firstword,lastword))

I also reformulated the pseudo-query-code from one of the query proposals to the new proposed API. Using Python. This is a complex query to selecting specific noun phrases followed by specific verb phrases within a specific context (chapter, sentence, book):

for book in store.resources_by_data(set="someset",key="name", value=DataOperator.any(DataOperator.equals("genesis"),DataOperator.equals("exodus")):
    for chapter in book.text_by_data(set="someset",key="type",value="chapter"):
        if chapter.test_data_about(set="someset",key="number",value=2):
            for sentence in chapter.text_by_data(set="someset",key="type", value="sentence"):
                if chapter.test_related_text(TextSelectionOperator.EMBEDS, sentence):
                    for nn in sentence.related_text(TextSelectionOperator.EMBEDS):
                        if nn.test_data_about(set="someset",key="type",value="word") and\
                           nn.test_data_about(set="someset",key="pos", value="noun") and\
                           nn.test_data_about(set="someset",key="gender",value="feminine") and\
                           nn.test_data_about(set="someset",key="number",value="singular"):
                            for vb in nn.related_text(TextSelectionOperator.PRECEDES):
                                if sentence.test_related_text(TextSelectionOperator.EMBEDS,vb) and\
                                    vb.test_data_about(set="someset",key="type",value="word") and\
                                    vb.test_data_about(set="someset",key="pos", value="verb") and\
                                    vb.test_data_about(set="someset",key="gender",value="feminine") and\
                                    vb.test_data_about(set="someset",key="number",value="plural"):
                                yield book, chapter, sentence, nn, vb

dirkroorda commented 1 year ago

I am a bit worried by the verbosity. You prefer to work with methods on annotation objects. Then you have to repeat the argument set="someset" all the time.

If you have an object that exposes the higher-level methods independent of the annotations, say F, you could say

F.setSet("someset")

before doing many calls to retrieve annotation values.

Then it would be nice if you could say:

fData = F.getData(key="type")

lookup = fData.lookup
support = fData.support
targets = fData.targets

lookup is a function that given a target t delivers the value of an annotation in "someset", with key "type" and target t.

support is a function that given a value v delivers all targets t of annotations in "someset" with key "type" and value v.

targets is a function that given an annotation and a value v delivers all targets t of that annotation provided there is an annotation in "someset" that has target t and key "type" and value v.

It is also handy to assume that textual_order is True by default.

With this, you could shorten the phrase lookup like so:

F.setSet("someset")
tpData = F.getData("type")
tpSupport = tpData.support
tpTargets = tpData.targets

nuData = F.getData("nu")
nuLookup = nuData.lookup

results = []
for phrase in tpSupport("phrase")):
    words = tpTargets(phrase, "word")
    if len(words) < 2:
        continue
    firstword = words[0]
    lastword = words[-1]
    if nuLookup(firstword) != nuLookup(lastWord):
        continue
    results.append((phrase,firstword,lastword))

proycon commented 1 year ago

I am a bit worried by the verbosity. You prefer to work with methods on annotation objects. Then you have to repeat the argument set="someset" all the time.

Yes, the underlying idea is that you have all kinds of objects with distinct methods to travel the edges.

If you have an object that exposes the higher-level methods independent of the annotations, say F, you could say F.setSet("someset")

That would already be possible with proposed API: it exposes methods to travel edges in almost every direction. If you want to be invariant over the set/key, then just grab a dataset and datakey instance and work from there. So there's often multiple ways of doing thing with this API, which does come with the disadvantage that the API is bigger than it could be, but this flexibility should hopefully match the flexibility the model itself provides and give some freedom to the choices of the modeller:

set = store.dataset("someset")
key = set.key("nu")

for phrase in key.annotations_by_data(value="phrase", textual_order=True):
  ...

It is also handy to assume that textual_order is True by default.

Yeah, I'm not entirely sure how I'm going to incorporate that parameter yet. If it's gonna have an extra cost (temporary buffer allocation) I don't like doing it by default.

proycon commented 1 year ago

lookup is a function that given a target t delivers the value of an annotation in "someset", with key "type" and target t.

That'd be t.find_data_about(set,key,value_test) . If you already have a DataKey instance like in my previous example, you should be able to pass it without the set (because it will know what set it belongs to). You can also do key.find_data(value_test) to get the annotation data, and then data.annotations() to get annotations referencing the data.

support is a function that given a value v delivers all targets t of annotations in "someset" with key "type" and value v.

Depending on the type of target you're looking for, that'd be: data.annotations(), data.resources(), data.dataset() etc.. In STAM it's harder to consider the targets a heterogeneous bunch (also due to it being implemented in a strongly typed language), usually you have to be explicit about which kind of target you want (annotations, textselections, resources, etc..).

targets is a function that given an annotation and a value v delivers all targets t of that annotation provided there is an annotation in "someset" that has target t and key "type" and value v.

That'd be annotation.annotations_by_data_in_targets(set, key, value_test) for annotations, though I suppose I also need methods for the other target types then. This one feels a bit too contrived still, not to say that it isn't a valid function, but the DIY route where it's split into two calls feels a bit more natural/understandable to me:

for annotation in annotation.annotations_in_targets()
    if annotation.test_data_about(set, key, value_test):
        ...

The first method can be replaced with resources(), datasets(), textselections() for the other types.

My method naming style may be a bit more verbose than you're accustomed to, but I feel the names have to be self-documenting to a certain extent so I'd rather be a bit explicit.

proycon commented 1 year ago

Over the summer period, this API has been implemented (in git master, both for stam-rust and stam-python, but not released yet).

proycon commented 1 year ago

Released in stam-rust 0.8.0