annotation / stam

Stand-off Text Annotation Model (STAM) is a data model for stand-off-text annotation where any information on a text is represented as an annotation. This repository contains the model's full specification, extensions, schemas, examples and documentation.
https://annotation.github.io/stam/
Creative Commons Attribution Share Alike 4.0 International
17 stars 2 forks source link

Formulate a STAM Query Language #12

Closed proycon closed 9 months ago

proycon commented 1 year ago

A query language should be formulated to effectively query a STAM model. The query language will be formulated as an extension to STAM and effectively provides a higher-level interface that can be directly exposed to end-users as the primary means of interacting with a STAM model.

The query language should be able to express (non-exhaustive):

The query language should be accessible enough for (technical) researchers.

proycon commented 1 year ago

@dirkroorda I took the liberty of also assigning you in this, because I think your query language on Text Fabric provides a good starting point for the kind of functionality we want in a query language.

dirkroorda commented 1 year ago

A few additional things to think of:

dirkroorda commented 1 year ago

The logical operations pose a problem, but it depends on the concept of result.

If you can shape the result by shaping the query, what do and and or mean?

If query A collects results of shape Sa, and query B collects results of shape Sb, what does A & B mean? It will often be the empty set because of difference in shape. Likewise, A | B is likely to be the union, which will become a heterogeneous set of results, shapewise.

Maybe you need a projection operator that maps results to the text strings they contain. Then you can apply boolean logic on the projections, but you loose information.

dirkroorda commented 1 year ago

Example:

Without projection, A & B is empty. But if we imply a projection operation before doing the &, then A & B is the set of verbs that have a preceding noun and a following noun.

And the nouns themselves got lost from the results.

Likewise, A | B delivers all verbs that either have a noun preceding or following or both. And the nouns involved are also in the results.

dirkroorda commented 1 year ago

All this points into a direction where you have two layers in your query language:

The structural layer delivers structural results. The logical layer flattens the structural results and applies set theory on them.

dirkroorda commented 1 year ago

Concerning data manipulation queries: that would be useful and dangerous. Without it STAM would not be complete (self-contained).

The query language must be able to retrieve the components of annotations (bodies, targets) in order to manipulate them.

That influences what you define as the result of queries. Not only the texts that annotations target, but also the annotations themselves can be results.

proycon commented 1 year ago

Thanks for your pointers @dirkroorda , gives interesting food for thought!

proycon commented 1 year ago

I just put my plans and thoughts regarding a query language for STAM in writing here. Focus here is more on the structure of the language and its implementation than on the actual language.

Feedback much appreciated!

dirkroorda commented 1 year ago

I think your examples become instantly much more readable if you employ some conventions, e.g. omitting the types of operator, i.e. the xxx:: that preceeds things like Embed. I think in the end the language should be such that that kind of typing will be inferred.

dirkroorda commented 1 year ago

You start bottom up: given the structure of the annotation data and texts, you design primitives and constructions that can query for annotations and text. After that you intend to define a higher level language that can be translated into this.

dirkroorda commented 1 year ago

What about starting at the other end? Go back to what these texts and annotations mean to humans. Take into account the shortcuts we humans make when we express our search intentions. Then wrap a simple, pleasing language around it. And then worry how this translates to something that can be executed.

Well, in the end I think it does not matter along which road you arrive at the query language, bottom up or top down. It can be worthwhile to do both.

dirkroorda commented 1 year ago

An idea following the lines above:

you can exploit that annotations always target text in the end.

So you could say in shorthand

type=VP ]] pos=noun

This looks for annotations with body type=VP that embed annotations with body pos=noun.

And that means: the text targeted by the first annotation embeds the text targeted by the second annotation.

dirkroorda commented 1 year ago

Another thing are the quotes.

There are several vocabularies: primitives of STAM, primitives of the query language itself, names of annotation sets, and values that annotation bodies can take, which often have their own vocabularies.

Normally, you quote values outside the vocabulary, and you do not quote values inside the vocabulary.

But with so many vocabularies that becomes tedious. And end users tend to not make very conscious distinctions between the vocabularies.

So I propose to not use quoting at all, but make it dependent on the syntax and semantics how to interpret the tokens in the query.

dirkroorda commented 1 year ago

Let's see what we can make of your example

SELECT TextSelection ?sentence WHERE
    AnnotationData "someset","type" DataOperator::Equals("sentence")

SELECT TextSelection ?word WHERE
    TextRelation ?sentence TextSelectionOperator::Embeds
    AnnotationData "someset","type" DataOperator::Equals("word")
    ALL AnnotationData "someset","freqLex" DataOperator::Not(DataOperator::LessThan(100))
dirkroorda commented 1 year ago
use AnnotationData someset

SELECT type=sentence ?sentence
WHERE ALL SELECT type=word ?word
      WHERE ?sentence ]] ?word
      HAVE freqLex < 100
dirkroorda commented 1 year ago

This is logically different from your statement, because I cannot see how your statement expresses that all words in the sentence have freqLex < 100.

Note the use statement. This sets the context and unloads a lot of repeated terms from the query.

dirkroorda commented 1 year ago

I think that queries with quantifiers should logically not deliver the quantified parts.

If I ask for things x such that for all related y something holds, I want to see the instances of x in the result set, but not the instances of y.

dirkroorda commented 1 year ago

But maybe I just don't understand your ALL.

Ah, it means that all words that satisfy the conditions so far, should also satisfy the following condition.

And that is exactly what my translation says. I think it is important to mark the conditions over which is quantified clearly.

So I translate your

X1
X2
X3
ALL
Y1
Y2

into

ALL
X1
X2
X3
HAVE
Y1
Y2

And after that there should also come an END so that you can add other conditions outside the quantifier.

dirkroorda commented 1 year ago

So that you can say

U1
U2
ALL
X1
X2
HAVE
Y1
Y2
END
V2
V3
proycon commented 1 year ago

Thanks again for all your feedback Dirk!

I think your examples become instantly much more readable if you employ some conventions, e.g. omitting the types of operator, i.e. the xxx:: that preceeds things like Embed. I think in the end the language should be such that that kind of typing will be inferred.

Agreed, the resulting language will definitely not have things like these operator types. Right now it's just a pseudo-language and those are there so I can directly tie things to internal constructs (and identify where new constructs are needed). But I concede it makes the reading a bit more complicated for now.

This issue may be a bit of a misnomer, as right now I'm not too concerned about the actual form of the final language yet, but more about how to formulate an executable query plan and build a query engine that is expressive and performant enough.

What about starting at the other end? Go back to what these texts and annotations mean to humans. Take into account the shortcuts we humans make when we express our search intentions. Then wrap a simple, pleasing language around it. And then worry how this translates to something that can be executed.

This is a good point and definitely something to take into account; we indeed shouldn't lose the perspective from the user part. I do find it hard to reason about queries from that perspective if I can't envision the implementation. If you go top-down you're also more likely to require things like automatic query optimisation, which is a step I would like to skip at first and defer to a later stage. I think working bottom-up makes it easier to deliver some results quickly, without needing to go all the way.

Well, in the end I think it does not matter along which road you arrive at the query language, bottom up or top down. It can be worthwhile to do both.

Agreed, it's good to try both and meet in the middle.

Another thing are the quotes.

Right, don't focus to much on those at this stage, it's just a pseudo language. Perhaps I should provide an example of how a query would look like in real code (Rust or Python), as that will be my first aim. Writing a parser to parse a query language is for a later step then.

I think that queries with quantifiers should logically not deliver the quantified parts. If I ask for things x such that for all related y something holds, I want to see the instances of x in the result set, but not the instances of y.

Good point, that is something I hadn't accounted for yet. The whole notion of quantifiers may still be a bit underdeveloped actually.

you can exploit that annotations always target text in the end.

In the vast majority of cases, yes, but not always.

Note the use statement. This sets the context and unloads a lot of repeated terms from the query.

I like the idea of a use statement yes, a bit analogous also to PREFIX in SPARQL. This will come in handy for the sets, which are indeed tedious to repeat.

proycon commented 1 year ago

The biggest question and concern at this point is if the overall query structure I propose is powerful enough to express all kinds of queries. With query structure I mean division into multiple SELECT clauses, each yielding a single variable, and each having one or more constraints.

Take this pseudo language query from the proposal:

SELECT TextResource ?book WHERE
    AnnotationData "someset","name" DataOperator::Any(DataOperator::Equals("Genesis"), DataOperator::Equals("Exodus"))

SELECT TextSelection ?chapter WHERE 
    TextResource ?book
    AnnotationData "someset","type" DataOperator::Equals("chapter")
    AnnotationData "someset","number" DataOperator::EqualsInt(2)

SELECT TextSelection ?sentence WHERE 
    AnnotationData "someset","type" DataOperator::Equals("sentence")
    TextRelation ?chapter TextSelectionOperator::Embeds

SELECT TextSelection ?nn WHERE
    TextRelation ?sentence TextSelectionOperator::Embeds
    AnnotationData "someset","type" DataOperator::Equals("word")
    AnnotationData "someset","pos" DataOperator::Equals("noun")
    AnnotationData "someset","gender" DataOperator::Equals("feminine")
    AnnotationData "someset","number" DataOperator::Equals("singular")

SELECT TextSelection ?vb WHERE
    TextRelation ?nn TextSelectionOperator::LeftAdjacent
    TextRelation ?sentence TextSelectionOperator::Embeds
    AnnotationData "someset","type" DataOperator::Equals("word")
    AnnotationData "someset","pos" DataOperator::Equals("verb")
    AnnotationData "someset","gender" DataOperator::Equals("feminine")
    AnnotationData "someset","number" DataOperator::Equals("plural")

I agree this is very verbose and we'd want to make this much more concise before exposing it to the user, but will be doable I think after we decide whether the overall structure is good enough.

Execution-wise, this query would translate to something like (Python pseudo code, not all you see here is in the STAM library):

for book in store.resources_by_data("someset","name", DataOperator.Any(DataOperator.Equals("genesis"),DataOperator.Equals("exodus")):
    for chapter in store.get_textselections_by_resource(book): #fictitious
        if not chapter.has_data("someset","type", "chapter"): continue
        if not chapter.has_data("someset","number",2): continue
        for sentence in chapter.get_textselections_by_data(set="someset",type="sentence"): #fictitious
            if not chapter.textselections().test(Embeds, sentence): continue
            for nn in sentence.find_textselections(Embeds):
                if not nn.has_data("someset","type","word"): continue
                if not nn.has_data("someset","pos", "noun"): continue
                if not nn.has_data("someset","gender","feminine"): continue
                if not nn.has_data("someset","number","singular"): continue
                for vb in nn.find_textselections(LeftAdjacent):
                    if not sentence.embeds(vb): continue
                    if not vb.has_data("someset","type","word"): continue
                    if not vb.has_data("someset","pos", "verb"): continue
                    if not vb.has_data("someset","gender","feminine"): continue
                    if not vb.has_data("someset","number","plural"): continue
                    yield book, chapter, sentence, nn, vb

(some of the methods in here are fictitious and already higher-level than the actual methods implemented in the STAM library, which is why I came to the conclusion we need a Query Language implementation after all, writing it out in full gets too complicated)

What I'm effectively doing is chaining together various iterators (each for loop) in a depth-first fashion. Each iterator (ideally) follows a path in one of the (reverse) indices that have already been built. One of the main questions I'd like to ask is if you think this is sufficiently expressive and performant enough. The best is probably to come with a counter example; are there types of queries which we want but can not express in such a way?

In the actual (Rust) implementation which I started, the whole query will be embodied in a single iterator (the query execution engine) which dynamically instantiates and invokes the appropriate iterators under the hood (it carries a stack of iterators and keeps track of their current 'head' result). So the whole thing uses lazy-evaluation throughout and has a minimal memory footprint. Getting this right in Rust is not without significant challenges though.

Do we think this is good route to take or should we go in a fundamentally different direction?

proycon commented 1 year ago

But maybe I just don't understand your ALL. Ah, it means that all words that satisfy the conditions so far, should also satisfy the following condition. And that is exactly what my translation says. I think it is important to mark the conditions over which is quantified clearly. .. And after that there should also come an END so that you can add other conditions outside the quantifier.

When moving ALL to before SELECT, you're saying that the quantifier is not a property of a single constraint, but of the select statement as a whole?

But what you said about marking conditions so you can also have conditions outside the quantifier suggests more than the quantifier is a property of one or more constraints?

The approach I had where it is just a property of a single constraint seems not right indeed.

proycon commented 1 year ago

Just for argument's sake. Here is an attempt at making the actual language a bit more approachable, following some of your suggestions without changing any of the underlying ideas:

USE someset

SELECT RESOURCE ?book WHERE 
    name = genesis|exodus

SELECT TEXT ?chapter WHERE
    IN ?book
    type = chapter
    number = 2

SELECT TEXT ?sentence WHERE
    type = sentence
    IN ?chapter

SELECT TEXT ?nn WHERE
    IN ?sentence
    type = word
    pos = noun
    gender = feminine
    number = singular

SELECT TEXT ?vb WHERE
    SUCCEEDS ?nn
    IN ?sentence
    type = word
    pos = verb
    gender = feminine
    number = plural
proycon commented 1 year ago

Another variant that does justice to the special nature of the first constraint (it is the thing that gets looped over, in SQL terms it would be a join and an ON statement), here I move it in front of the WHERE clause, also somewhat related to an earlier suggestion where you did something similar:

USE someset

SELECT RESOURCE ?book name = genesis|exodus

SELECT TEXT ?chapter IN ?book WHERE
    type = chapter
    number = 2

SELECT TEXT ?sentence type = sentence WHERE
    IN ?chapter

SELECT TEXT ?nn IN ?sentence WHERE
    type = word
    pos = noun
    gender = feminine
    number = singular

SELECT TEXT ?vb SUCCEEDS ?nn WHERE
    IN ?sentence
    type = word
    pos = verb
    gender = feminine
    number = plural
proycon commented 1 year ago

Here's also some draft on what query formulation could look when done in Rust respectively Python:

Rust

let queries = [
    Query::select_resource("book"))
        .data("someset", "name", DataOperator::Any(DataOperator::Equals("genesis")), DataOperator::Equals("exodus")))),
    Query::select_text("chapter")
        .resource_var("book")
        .data("someset", "type", DataOperator::Equals("chapter"))
    Query::select_text("sentence")
        .data("someset", "type", DataOperator::Equals("sentence"))
        .textrelation_var("chapter", TextRelationOperator::Embeds),
    Query::select_text("nn")
        .textrelation_var("sentence", TextRelationOperator::Embeds),
        .data("someset", "type", DataOperator::Equals("word"))
        .data("someset", "pos", DataOperator::Equals("noun"))
        .data("someset", "gender", DataOperator::Equals("feminine"))
        .data("someset", "number", DataOperator::Equals("singular"))
    Query::select_text("vb")
        .textrelation_var("nn", TextRelationOperator::Precedes),
        .data("someset", "type", DataOperator::Equals("word"))
        .data("someset", "pos", DataOperator::Equals("verb"))
        .data("someset", "gender", DataOperator::Equals("feminine"))
        .data("someset", "number", DataOperator::Equals("plural"))
];

and then to turn all this into an iterator you'd do something like:

let iter = store.query(queries.into_iter())?;
//nothing is queried yet at this point
for results in iter {
    //only now we get results... (results will be a complex object containing all of the vars from the queries)
}

Python

queries = [
    Query.select_resource("book")
        .data("someset", "name", any=("genesis", "exodus")),
    Query.select_text("chapter")
        .resource(var="book")
        .data("someset", "type", "chapter")
    Query.select_text("sentence")
        .data("someset", "type", "sentence")
        .textrelation(var="chapter", TextRelationOperator.embeds),
    Query.select_text("nn")
        .textrelation(var="sentence", TextRelationOperator.embeds),
        .data("someset", "type", "word")
        .data("someset", "pos", "noun")
        .data("someset", "gender", "feminine")
        .data("someset", "number", "singular")
    Query.select_text("vb")
        .textrelation(var="nn", TextRelationOperator.precedes),
        .data("someset", "type", "word")
        .data("someset", "pos", "verb")
        .data("someset", "gender", "feminine")
        .data("someset", "number", "plural")
]

And to actually query you'd do store.query(queries), which would probably return a list of dicts or tuples for all results in the Python case (leveraging the native Rust part to do its job uninterrupted).

dirkroorda commented 1 year ago

I like your shorthand notations already, also the second one. It leaves me with one question though:

Consider these fragments:

SELECT TEXT ?chapter IN ?book WHERE
    type = chapter
    number = 2

and

SELECT TEXT ?sentence type = sentence WHERE
    IN ?chapter

It looks as if you may put a condition between the SELECT and the WHERE.

And the remaining conditions you put after the WHERE.

In the first case you put IN ?book before the WHERE and type = chapter after it.

In the second case you do it the other way round:

type = sentence before the WHERE and IN ?chapter behind it.

Is there a preference for some conditions to put in front of the where and some others after the where?

dirkroorda commented 1 year ago

Returning to your question whether your proposed language is complete enough ...

The sequence of SELECT statements allow you to select tuples of items, satisfying all kinds of conditions. Later SELECT statements may have conditions that refer to items selected in earlier SELECT statements.

All this gives you a lot, but I think one thing is still missing:

a list of statements after the SELECT statements that express additional conditions on the objects found by the SELECT statements.

Maybe you could add those conditions already to the individual SELECT statements, but that will enforce the writing of more complicated SELECT clauses than needed.

For example, you might want to search for three kinds of phrases in a sentence, each with complicated conditions, and then state that they need to occur in a few specified orders only:

SELECT TEXT ?s WHERE
  type = sentence

SELECT TEXT ?p1 IN ?s WHERE
  type = phrase
  ... many other conditions ...

SELECT TEXT ?p2 IN ?s WHERE
  type = phrase
  ... many other conditions ...

SELECT TEXT ?p3 IN ?s WHERE
  type = phrase
  ... many other conditions ...

?p1 BEFORE ?p3
?p1 UNEQUAL ?p2
?p2 UNEQUAL ?p3
proycon commented 1 year ago

Is there a preference for some conditions to put in front of the where and some others after the where?

I'm still on the fence on whether we want to move the first constraint before the WHERE clause or not. The motivation behind this is that the first constraint is treated a bit differently than the remaining constraints. The first constraint determines the target item and the way it is retrieved in a loop, whereas the remaining constraints typically just test the target item against some criteria. You can best see this in the pseudo-python code I listed earlier.

I'm not sure whether we want to expose this implementation detail to the user in a query language. It may be easier not to.

The larger underlying issue here is the ordering of constraints, the choice of which constraint comes first is important, but also the ordering of further constraints. This determines how effective the search will be. Ideally you want the most-constraining constraint first, achieving a maximum reduction of the search space as early as possible. Eventually, this kind of optimisation can be done by a query optimiser which reorders constraints automatically, but in this early stage, you have to spell it all out explicitly.

dirkroorda commented 1 year ago

Ah. I think the query writer will not have enough insight to determine the most efficient order. Even in Text-Fabric, I do not know the most efficient order in advance. Before running the query for real, the system performs some experiments to gather metrics on the conditions, and based on those metrics a query plan is made.

If you want to give hints for the query execution, maybe it is better to do that by means of very different syntax, separated from the logical query itself.

proycon commented 1 year ago

All this gives you a lot, but I think one thing is still missing: a list of statements after the SELECT statements that express additional conditions on the objects found by the SELECT statements. Maybe you could add those conditions already to the individual SELECT statements, but that will enforce the writing of more complicated SELECT clauses than needed. For example, you might want to search for three kinds of phrases in a sentence, each with complicated conditions, and then state that they need to occur in a few specified orders only:

It's indeed the question whether we can already express that in the current structure (inside the select statements), and in case of your example I think we can. I'm not sure if that's even much more complex. But even if it is and we like that notation better, we can in a later stage map from a more free-form convenient notation to the form I have now. That would again be the job of the query optimiser. At this stage I'm interpreting the queries more strictly as complete execution plans, so they're more rigid in their nature.

But, your example query could also be interpreted as a different execution plan, than I've been proposing, namely:

  1. do the select queries
  2. hold the results in memory
  3. do another pass and apply the final conditions

I think that would be sub-optimal if the same can be achieved without needing the additional buffer (in the depth-first manner I'm suggesting).

But the open question remains; can we map all conceivable queries we want to do to the form I have now?

proycon commented 1 year ago

Ah. I think the query writer will not have enough insight to determine the most efficient order. Even in Text-Fabric, I do not know the most efficient order in advance. Before running the query for real, the system performs some experiments to gather metrics on the conditions, and based on those metrics a query plan is made.

Agreed, determining the most efficient order is tough and depends greatly on the data. Eventually I also want to implement such a query optimiser which takes a query where the subqueries and constraints are not strictly ordered, computes a dependency tree on it and follows various heuristics (I think that corresponds to things you call 'thickness of the yarn' in text fabric) to compute a decent ordering of constraints. But such an implementation is again a complex project.

I'm trying to separate things into multiple stages, first I'll already be happy if we can get querying done (in code) in the more rigid form, which by the way also has its merit as it allows more fine-grained low-level control to the user (some may want that). Then after that is done the next step would be writing a query optimiser.

I'm also hoping that the current structure with select statements and constraints, is flexible enough to serve both for the rigid structure as for the free-form structure later (so I don't need to invent more primitive and the task just becomes mapping the former to the latter by reordering thing or mapping some things).

dirkroorda commented 1 year ago

Indeed, those constraints after the selects should not be applied as a second filter, because they are important to constrain the search space.

The main importance of separating (relational) constraints at the end is for users that are gradually developing their queries. They often have certain concepts in mind (phrases like this, sentences like that) and want to find them combined, with certain relationships between them. It is nice to have a place where you can vary the relationships and holding the rest of the conditions constant.

But for query execution it is better to see the query as a graph: the nodes are the things you look for, and the edges are the relationships between those things. No matter where in the query you express the relationships, they all yield the same query graph.

dirkroorda commented 1 year ago

As to the question: "can we map all conceivable queries we want to do to the form I have now?"

I think not, and probably for the better.

For example: "give me all pairs of sentences that do not have a word in common"

SELECT TEXT ?s1 WHERE
  type = sentence

SELECT TEXT ?s2 WHERE
  type = sentence
  ALL SELECT TEXT ?w2 WHERE
      ?s2 EMBEDS ?w2
  HAVE
      ALL SELECT TEXT ?w1 WHERE
          ?s1 EMBEDS ?w1
      HAVE
          ?w1.lexeme NOT EQUAL ?w2.lexeme

The meaning of this query is clear, but maybe it is forbiddingly difficult to implement this. The last ?w2 is a variable quantified by an outer quantifier. If you can implement this in general, you have implemented predicate logic, and even prolog cannot do full predicate logic.

But I do think that those chains of SELECT statements define a very reasonable query language, even without the ALL or EXIST. Users that want to look for things that require predicate logic must be enabled to program their own solutions, in my opinion. Not as query language, but as a way to walk over the corpus.

dirkroorda commented 1 year ago

By the way, your intention to write the first version of the query language as a shorthand for Rust code is something that I also had in mind in Text-Fabric. Once the concept of query was working, people quickly used it as a shorthand for Python programming. And that is also how I advertise it sometimes.

The surprising thing was that the queries could compete in speed with hand-written code in many cases.

proycon commented 10 months ago

The query language has been formulated (https://github.com/annotation/stam/tree/master/extensions/stam-query) and implemented. The specification still needs to be checked for thoroughness and clarity.