Closed proycon closed 9 months ago
@dirkroorda I took the liberty of also assigning you in this, because I think your query language on Text Fabric provides a good starting point for the kind of functionality we want in a query language.
A few additional things to think of:
The logical operations pose a problem, but it depends on the concept of result.
If you can shape the result by shaping the query, what do and and or mean?
If query A collects results of shape Sa, and query B collects results of shape Sb, what does A & B mean? It will often be the empty set because of difference in shape. Likewise, A | B is likely to be the union, which will become a heterogeneous set of results, shapewise.
Maybe you need a projection operator that maps results to the text strings they contain. Then you can apply boolean logic on the projections, but you loose information.
Example:
Without projection, A & B is empty. But if we imply a projection operation before doing the &, then A & B is the set of verbs that have a preceding noun and a following noun.
And the nouns themselves got lost from the results.
Likewise, A | B delivers all verbs that either have a noun preceding or following or both. And the nouns involved are also in the results.
All this points into a direction where you have two layers in your query language:
The structural layer delivers structural results. The logical layer flattens the structural results and applies set theory on them.
Concerning data manipulation queries: that would be useful and dangerous. Without it STAM would not be complete (self-contained).
The query language must be able to retrieve the components of annotations (bodies, targets) in order to manipulate them.
That influences what you define as the result of queries. Not only the texts that annotations target, but also the annotations themselves can be results.
Thanks for your pointers @dirkroorda , gives interesting food for thought!
I just put my plans and thoughts regarding a query language for STAM in writing here. Focus here is more on the structure of the language and its implementation than on the actual language.
Feedback much appreciated!
I think your examples become instantly much more readable if you employ some conventions, e.g. omitting the types of operator, i.e. the xxx::
that preceeds things like Embed
. I think in the end the language should be such that that kind of typing will be inferred.
You start bottom up: given the structure of the annotation data and texts, you design primitives and constructions that can query for annotations and text. After that you intend to define a higher level language that can be translated into this.
What about starting at the other end? Go back to what these texts and annotations mean to humans. Take into account the shortcuts we humans make when we express our search intentions. Then wrap a simple, pleasing language around it. And then worry how this translates to something that can be executed.
Well, in the end I think it does not matter along which road you arrive at the query language, bottom up or top down. It can be worthwhile to do both.
An idea following the lines above:
you can exploit that annotations always target text in the end.
So you could say in shorthand
type=VP ]] pos=noun
This looks for annotations with body type=VP that embed annotations with body pos=noun.
And that means: the text targeted by the first annotation embeds the text targeted by the second annotation.
Another thing are the quotes.
There are several vocabularies: primitives of STAM, primitives of the query language itself, names of annotation sets, and values that annotation bodies can take, which often have their own vocabularies.
Normally, you quote values outside the vocabulary, and you do not quote values inside the vocabulary.
But with so many vocabularies that becomes tedious. And end users tend to not make very conscious distinctions between the vocabularies.
So I propose to not use quoting at all, but make it dependent on the syntax and semantics how to interpret the tokens in the query.
Let's see what we can make of your example
SELECT TextSelection ?sentence WHERE
AnnotationData "someset","type" DataOperator::Equals("sentence")
SELECT TextSelection ?word WHERE
TextRelation ?sentence TextSelectionOperator::Embeds
AnnotationData "someset","type" DataOperator::Equals("word")
ALL AnnotationData "someset","freqLex" DataOperator::Not(DataOperator::LessThan(100))
use AnnotationData someset
SELECT type=sentence ?sentence
WHERE ALL SELECT type=word ?word
WHERE ?sentence ]] ?word
HAVE freqLex < 100
This is logically different from your statement, because I cannot see how your statement expresses that all words in the sentence have freqLex < 100.
Note the use
statement. This sets the context and unloads a lot of repeated terms from the query.
I think that queries with quantifiers should logically not deliver the quantified parts.
If I ask for things x such that for all related y something holds, I want to see the instances of x in the result set, but not the instances of y.
But maybe I just don't understand your ALL
.
Ah, it means that all words that satisfy the conditions so far, should also satisfy the following condition.
And that is exactly what my translation says. I think it is important to mark the conditions over which is quantified clearly.
So I translate your
X1
X2
X3
ALL
Y1
Y2
into
ALL
X1
X2
X3
HAVE
Y1
Y2
And after that there should also come an END
so that you can add other conditions outside the quantifier.
So that you can say
U1
U2
ALL
X1
X2
HAVE
Y1
Y2
END
V2
V3
Thanks again for all your feedback Dirk!
I think your examples become instantly much more readable if you employ some conventions, e.g. omitting the types of operator, i.e. the
xxx::
that preceeds things likeEmbed
. I think in the end the language should be such that that kind of typing will be inferred.
Agreed, the resulting language will definitely not have things like these operator types. Right now it's just a pseudo-language and those are there so I can directly tie things to internal constructs (and identify where new constructs are needed). But I concede it makes the reading a bit more complicated for now.
This issue may be a bit of a misnomer, as right now I'm not too concerned about the actual form of the final language yet, but more about how to formulate an executable query plan and build a query engine that is expressive and performant enough.
What about starting at the other end? Go back to what these texts and annotations mean to humans. Take into account the shortcuts we humans make when we express our search intentions. Then wrap a simple, pleasing language around it. And then worry how this translates to something that can be executed.
This is a good point and definitely something to take into account; we indeed shouldn't lose the perspective from the user part. I do find it hard to reason about queries from that perspective if I can't envision the implementation. If you go top-down you're also more likely to require things like automatic query optimisation, which is a step I would like to skip at first and defer to a later stage. I think working bottom-up makes it easier to deliver some results quickly, without needing to go all the way.
Well, in the end I think it does not matter along which road you arrive at the query language, bottom up or top down. It can be worthwhile to do both.
Agreed, it's good to try both and meet in the middle.
Another thing are the quotes.
Right, don't focus to much on those at this stage, it's just a pseudo language. Perhaps I should provide an example of how a query would look like in real code (Rust or Python), as that will be my first aim. Writing a parser to parse a query language is for a later step then.
I think that queries with quantifiers should logically not deliver the quantified parts. If I ask for things x such that for all related y something holds, I want to see the instances of x in the result set, but not the instances of y.
Good point, that is something I hadn't accounted for yet. The whole notion of quantifiers may still be a bit underdeveloped actually.
you can exploit that annotations always target text in the end.
In the vast majority of cases, yes, but not always.
Note the
use
statement. This sets the context and unloads a lot of repeated terms from the query.
I like the idea of a use statement yes, a bit analogous also to PREFIX
in SPARQL. This will come in handy for the sets, which are indeed
tedious to repeat.
The biggest question and concern at this point is if the overall query
structure I propose is powerful enough to express all kinds of queries. With
query structure I mean division into multiple SELECT
clauses, each yielding a
single variable, and each having one or more constraints.
Take this pseudo language query from the proposal:
SELECT TextResource ?book WHERE
AnnotationData "someset","name" DataOperator::Any(DataOperator::Equals("Genesis"), DataOperator::Equals("Exodus"))
SELECT TextSelection ?chapter WHERE
TextResource ?book
AnnotationData "someset","type" DataOperator::Equals("chapter")
AnnotationData "someset","number" DataOperator::EqualsInt(2)
SELECT TextSelection ?sentence WHERE
AnnotationData "someset","type" DataOperator::Equals("sentence")
TextRelation ?chapter TextSelectionOperator::Embeds
SELECT TextSelection ?nn WHERE
TextRelation ?sentence TextSelectionOperator::Embeds
AnnotationData "someset","type" DataOperator::Equals("word")
AnnotationData "someset","pos" DataOperator::Equals("noun")
AnnotationData "someset","gender" DataOperator::Equals("feminine")
AnnotationData "someset","number" DataOperator::Equals("singular")
SELECT TextSelection ?vb WHERE
TextRelation ?nn TextSelectionOperator::LeftAdjacent
TextRelation ?sentence TextSelectionOperator::Embeds
AnnotationData "someset","type" DataOperator::Equals("word")
AnnotationData "someset","pos" DataOperator::Equals("verb")
AnnotationData "someset","gender" DataOperator::Equals("feminine")
AnnotationData "someset","number" DataOperator::Equals("plural")
I agree this is very verbose and we'd want to make this much more concise before exposing it to the user, but will be doable I think after we decide whether the overall structure is good enough.
Execution-wise, this query would translate to something like (Python pseudo code, not all you see here is in the STAM library):
for book in store.resources_by_data("someset","name", DataOperator.Any(DataOperator.Equals("genesis"),DataOperator.Equals("exodus")):
for chapter in store.get_textselections_by_resource(book): #fictitious
if not chapter.has_data("someset","type", "chapter"): continue
if not chapter.has_data("someset","number",2): continue
for sentence in chapter.get_textselections_by_data(set="someset",type="sentence"): #fictitious
if not chapter.textselections().test(Embeds, sentence): continue
for nn in sentence.find_textselections(Embeds):
if not nn.has_data("someset","type","word"): continue
if not nn.has_data("someset","pos", "noun"): continue
if not nn.has_data("someset","gender","feminine"): continue
if not nn.has_data("someset","number","singular"): continue
for vb in nn.find_textselections(LeftAdjacent):
if not sentence.embeds(vb): continue
if not vb.has_data("someset","type","word"): continue
if not vb.has_data("someset","pos", "verb"): continue
if not vb.has_data("someset","gender","feminine"): continue
if not vb.has_data("someset","number","plural"): continue
yield book, chapter, sentence, nn, vb
(some of the methods in here are fictitious and already higher-level than the actual methods implemented in the STAM library, which is why I came to the conclusion we need a Query Language implementation after all, writing it out in full gets too complicated)
What I'm effectively doing is chaining together various iterators (each for loop) in a depth-first fashion. Each iterator (ideally) follows a path in one of the (reverse) indices that have already been built. One of the main questions I'd like to ask is if you think this is sufficiently expressive and performant enough. The best is probably to come with a counter example; are there types of queries which we want but can not express in such a way?
In the actual (Rust) implementation which I started, the whole query will be embodied in a single iterator (the query execution engine) which dynamically instantiates and invokes the appropriate iterators under the hood (it carries a stack of iterators and keeps track of their current 'head' result). So the whole thing uses lazy-evaluation throughout and has a minimal memory footprint. Getting this right in Rust is not without significant challenges though.
Do we think this is good route to take or should we go in a fundamentally different direction?
But maybe I just don't understand your
ALL
. Ah, it means that all words that satisfy the conditions so far, should also satisfy the following condition. And that is exactly what my translation says. I think it is important to mark the conditions over which is quantified clearly. .. And after that there should also come anEND
so that you can add other conditions outside the quantifier.
When moving ALL
to before SELECT
, you're saying that the quantifier is not a property of a single constraint, but of the select statement as a whole?
But what you said about marking conditions so you can also have conditions outside the quantifier suggests more than the quantifier is a property of one or more constraints?
The approach I had where it is just a property of a single constraint seems not right indeed.
Just for argument's sake. Here is an attempt at making the actual language a bit more approachable, following some of your suggestions without changing any of the underlying ideas:
USE someset
SELECT RESOURCE ?book WHERE
name = genesis|exodus
SELECT TEXT ?chapter WHERE
IN ?book
type = chapter
number = 2
SELECT TEXT ?sentence WHERE
type = sentence
IN ?chapter
SELECT TEXT ?nn WHERE
IN ?sentence
type = word
pos = noun
gender = feminine
number = singular
SELECT TEXT ?vb WHERE
SUCCEEDS ?nn
IN ?sentence
type = word
pos = verb
gender = feminine
number = plural
Another variant that does justice to the special nature of the first constraint
(it is the thing that gets looped over, in SQL terms it would be a join and an ON
statement), here I move it in front of the WHERE
clause, also somewhat related to an earlier suggestion where you did something similar:
USE someset
SELECT RESOURCE ?book name = genesis|exodus
SELECT TEXT ?chapter IN ?book WHERE
type = chapter
number = 2
SELECT TEXT ?sentence type = sentence WHERE
IN ?chapter
SELECT TEXT ?nn IN ?sentence WHERE
type = word
pos = noun
gender = feminine
number = singular
SELECT TEXT ?vb SUCCEEDS ?nn WHERE
IN ?sentence
type = word
pos = verb
gender = feminine
number = plural
Here's also some draft on what query formulation could look when done in Rust respectively Python:
let queries = [
Query::select_resource("book"))
.data("someset", "name", DataOperator::Any(DataOperator::Equals("genesis")), DataOperator::Equals("exodus")))),
Query::select_text("chapter")
.resource_var("book")
.data("someset", "type", DataOperator::Equals("chapter"))
Query::select_text("sentence")
.data("someset", "type", DataOperator::Equals("sentence"))
.textrelation_var("chapter", TextRelationOperator::Embeds),
Query::select_text("nn")
.textrelation_var("sentence", TextRelationOperator::Embeds),
.data("someset", "type", DataOperator::Equals("word"))
.data("someset", "pos", DataOperator::Equals("noun"))
.data("someset", "gender", DataOperator::Equals("feminine"))
.data("someset", "number", DataOperator::Equals("singular"))
Query::select_text("vb")
.textrelation_var("nn", TextRelationOperator::Precedes),
.data("someset", "type", DataOperator::Equals("word"))
.data("someset", "pos", DataOperator::Equals("verb"))
.data("someset", "gender", DataOperator::Equals("feminine"))
.data("someset", "number", DataOperator::Equals("plural"))
];
and then to turn all this into an iterator you'd do something like:
let iter = store.query(queries.into_iter())?;
//nothing is queried yet at this point
for results in iter {
//only now we get results... (results will be a complex object containing all of the vars from the queries)
}
queries = [
Query.select_resource("book")
.data("someset", "name", any=("genesis", "exodus")),
Query.select_text("chapter")
.resource(var="book")
.data("someset", "type", "chapter")
Query.select_text("sentence")
.data("someset", "type", "sentence")
.textrelation(var="chapter", TextRelationOperator.embeds),
Query.select_text("nn")
.textrelation(var="sentence", TextRelationOperator.embeds),
.data("someset", "type", "word")
.data("someset", "pos", "noun")
.data("someset", "gender", "feminine")
.data("someset", "number", "singular")
Query.select_text("vb")
.textrelation(var="nn", TextRelationOperator.precedes),
.data("someset", "type", "word")
.data("someset", "pos", "verb")
.data("someset", "gender", "feminine")
.data("someset", "number", "plural")
]
And to actually query you'd do store.query(queries)
, which would probably return a list of dicts or tuples for all results in the Python case (leveraging the native Rust part to do its job uninterrupted).
I like your shorthand notations already, also the second one. It leaves me with one question though:
Consider these fragments:
SELECT TEXT ?chapter IN ?book WHERE
type = chapter
number = 2
and
SELECT TEXT ?sentence type = sentence WHERE
IN ?chapter
It looks as if you may put a condition between the SELECT
and the WHERE
.
And the remaining conditions you put after the WHERE
.
In the first case you put IN ?book
before the WHERE
and type = chapter
after it.
In the second case you do it the other way round:
type = sentence
before the WHERE
and IN ?chapter
behind it.
Is there a preference for some conditions to put in front of the where and some others after the where?
Returning to your question whether your proposed language is complete enough ...
The sequence of SELECT statements allow you to select tuples of items, satisfying all kinds of conditions. Later SELECT statements may have conditions that refer to items selected in earlier SELECT statements.
All this gives you a lot, but I think one thing is still missing:
a list of statements after the SELECT statements that express additional conditions on the objects found by the SELECT statements.
Maybe you could add those conditions already to the individual SELECT statements, but that will enforce the writing of more complicated SELECT clauses than needed.
For example, you might want to search for three kinds of phrases in a sentence, each with complicated conditions, and then state that they need to occur in a few specified orders only:
SELECT TEXT ?s WHERE
type = sentence
SELECT TEXT ?p1 IN ?s WHERE
type = phrase
... many other conditions ...
SELECT TEXT ?p2 IN ?s WHERE
type = phrase
... many other conditions ...
SELECT TEXT ?p3 IN ?s WHERE
type = phrase
... many other conditions ...
?p1 BEFORE ?p3
?p1 UNEQUAL ?p2
?p2 UNEQUAL ?p3
Is there a preference for some conditions to put in front of the where and some others after the where?
I'm still on the fence on whether we want to move the first constraint before the WHERE clause or not. The motivation behind this is that the first constraint is treated a bit differently than the remaining constraints. The first constraint determines the target item and the way it is retrieved in a loop, whereas the remaining constraints typically just test the target item against some criteria. You can best see this in the pseudo-python code I listed earlier.
I'm not sure whether we want to expose this implementation detail to the user in a query language. It may be easier not to.
The larger underlying issue here is the ordering of constraints, the choice of which constraint comes first is important, but also the ordering of further constraints. This determines how effective the search will be. Ideally you want the most-constraining constraint first, achieving a maximum reduction of the search space as early as possible. Eventually, this kind of optimisation can be done by a query optimiser which reorders constraints automatically, but in this early stage, you have to spell it all out explicitly.
Ah. I think the query writer will not have enough insight to determine the most efficient order. Even in Text-Fabric, I do not know the most efficient order in advance. Before running the query for real, the system performs some experiments to gather metrics on the conditions, and based on those metrics a query plan is made.
If you want to give hints for the query execution, maybe it is better to do that by means of very different syntax, separated from the logical query itself.
All this gives you a lot, but I think one thing is still missing: a list of statements after the SELECT statements that express additional conditions on the objects found by the SELECT statements. Maybe you could add those conditions already to the individual SELECT statements, but that will enforce the writing of more complicated SELECT clauses than needed. For example, you might want to search for three kinds of phrases in a sentence, each with complicated conditions, and then state that they need to occur in a few specified orders only:
It's indeed the question whether we can already express that in the current structure (inside the select statements), and in case of your example I think we can. I'm not sure if that's even much more complex. But even if it is and we like that notation better, we can in a later stage map from a more free-form convenient notation to the form I have now. That would again be the job of the query optimiser. At this stage I'm interpreting the queries more strictly as complete execution plans, so they're more rigid in their nature.
But, your example query could also be interpreted as a different execution plan, than I've been proposing, namely:
I think that would be sub-optimal if the same can be achieved without needing the additional buffer (in the depth-first manner I'm suggesting).
But the open question remains; can we map all conceivable queries we want to do to the form I have now?
Ah. I think the query writer will not have enough insight to determine the most efficient order. Even in Text-Fabric, I do not know the most efficient order in advance. Before running the query for real, the system performs some experiments to gather metrics on the conditions, and based on those metrics a query plan is made.
Agreed, determining the most efficient order is tough and depends greatly on the data. Eventually I also want to implement such a query optimiser which takes a query where the subqueries and constraints are not strictly ordered, computes a dependency tree on it and follows various heuristics (I think that corresponds to things you call 'thickness of the yarn' in text fabric) to compute a decent ordering of constraints. But such an implementation is again a complex project.
I'm trying to separate things into multiple stages, first I'll already be happy if we can get querying done (in code) in the more rigid form, which by the way also has its merit as it allows more fine-grained low-level control to the user (some may want that). Then after that is done the next step would be writing a query optimiser.
I'm also hoping that the current structure with select statements and constraints, is flexible enough to serve both for the rigid structure as for the free-form structure later (so I don't need to invent more primitive and the task just becomes mapping the former to the latter by reordering thing or mapping some things).
Indeed, those constraints after the selects should not be applied as a second filter, because they are important to constrain the search space.
The main importance of separating (relational) constraints at the end is for users that are gradually developing their queries. They often have certain concepts in mind (phrases like this, sentences like that) and want to find them combined, with certain relationships between them. It is nice to have a place where you can vary the relationships and holding the rest of the conditions constant.
But for query execution it is better to see the query as a graph: the nodes are the things you look for, and the edges are the relationships between those things. No matter where in the query you express the relationships, they all yield the same query graph.
As to the question: "can we map all conceivable queries we want to do to the form I have now?"
I think not, and probably for the better.
For example: "give me all pairs of sentences that do not have a word in common"
SELECT TEXT ?s1 WHERE
type = sentence
SELECT TEXT ?s2 WHERE
type = sentence
ALL SELECT TEXT ?w2 WHERE
?s2 EMBEDS ?w2
HAVE
ALL SELECT TEXT ?w1 WHERE
?s1 EMBEDS ?w1
HAVE
?w1.lexeme NOT EQUAL ?w2.lexeme
The meaning of this query is clear, but maybe it is forbiddingly difficult to implement this.
The last ?w2
is a variable quantified by an outer quantifier.
If you can implement this in general, you have implemented predicate logic, and even prolog cannot do full predicate logic.
But I do think that those chains of SELECT statements define a very reasonable query language, even without the ALL or EXIST. Users that want to look for things that require predicate logic must be enabled to program their own solutions, in my opinion. Not as query language, but as a way to walk over the corpus.
By the way, your intention to write the first version of the query language as a shorthand for Rust code is something that I also had in mind in Text-Fabric. Once the concept of query was working, people quickly used it as a shorthand for Python programming. And that is also how I advertise it sometimes.
The surprising thing was that the queries could compete in speed with hand-written code in many cases.
The query language has been formulated (https://github.com/annotation/stam/tree/master/extensions/stam-query) and implemented. The specification still needs to be checked for thoroughness and clarity.
A query language should be formulated to effectively query a STAM model. The query language will be formulated as an extension to STAM and effectively provides a higher-level interface that can be directly exposed to end-users as the primary means of interacting with a STAM model.
The query language should be able to express (non-exhaustive):
TextSelectionOperator
)The query language should be accessible enough for (technical) researchers.