INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene
http://inl.github.io/BlackLab/
Apache License 2.0
104 stars 53 forks source link

Alternative, JSON-based query language for low-level control and easier querybuilders #422

Closed jan-niestadt closed 8 months ago

jan-niestadt commented 1 year ago

(this comment was superseded, see below)

E.g. add an extension function _posfilter(producer, filter, operation, invert) that just creates a SpanQueryPositionFilter. Every query's toString() would also be updated to produce a working query, so also _posfilter(...) in this example.

This makes experimentation with new features and optimizations easier, because you can just try out different low-level queries in the user interface and compare the differences in speed an results.

These functions should start with an underscore to reflect that they're not really intended as a stable, user-friendly CQL extension for end users and may change at any time.

jan-niestadt commented 1 year ago

Maybe a better alternative: add a new query language that is just a JSON structure describing the SpanQuery structure to instantiate, e.g. pattlang=jsonq. This gives us a clean way to play around with all the query features. It would also be easier to create a query builder for this, because it's easier to serialize to/from this JSON structure than CQL.

For example, <s sentiment="happy" /> !containing "whee" is currently not a valid BlackLab CQL query (because of the !containing operator). The alternative <s sentiment="happy" /> & !(<s/> containing "whee") is possible but currently not optimized to the structure suggested by the first query. In jsonq you could just specify exactly the query structure you want:

{
    "type": "posfilter",
    "operation": "containing",
    "invert": true,
    "producer": {
        "type": "tag",
        "tagName": "s",
        "attributes": {
            "sentiment": "happy"
        }
    },
    "leftAdjust": 0,
    "rightAdjust": 0,
    "filter": {
        "type": "term",
        "term": "whee"
    }
}

This would be somewhat implementation-dependent, although usually SpanQuery classes are only added, and if one was ever removed or changed significantly, we could maintain support for its jsonq syntax, rewriting it to the most obvious modern alternative.

jan-niestadt commented 8 months ago

This is now possible on dev. JSON structure for BCQL query is returned in summary.pattern.json and the same JSON structure can be passed in the patt parameter as well. See https://inl.github.io/BlackLab/server/rest-api/corpus/parse-pattern/get.html#json-query-structure