StractOrg / stract

web search done right
https://stract.com
GNU Affero General Public License v3.0
1.94k stars 43 forks source link

[Proposal] Optics language rework #140

Open SekoiaTree opened 4 months ago

SekoiaTree commented 4 months ago

The optics language, as it is, is usable, but not extremely useful as a DSL. It could be entirely expressed as JSON. Additionally, the current syntax is strangely restrictive in some aspects. Actions always need to be at the end of a rule, for example. Commas are... semi-optional. These are mostly quirks of how the LR(1) language is written, because Optics is constructed to not even really be compiled, just parsed.

A future objective is to make the language actually work like a language, to allow for more flexible filtering. I would like this issue to serve as a central repository for this rework, serving both as a place to think this out cleanly, as well as a sort of initial documentation. I'm going to write

Background

The current optics pipeline goes from parsing RawRules and similar into their parsed versions. The parsed versions are mostly identical to their raw versions, with a few simplifications and reductions to make them easier to manage. Optics are then converted into Tantivy queries every time a request is made. Additionally, this conversion to Tantivy queries is done in core.

Several other places in core convert things to optics,

Internal architecture

Optics should be kept in their own crate as much as possible. The crate itself should only output Tantivy requests, and optics should be opaque from a core crate perspective. This allows a much more flexible system, with Optic language reworks, alternative syntaxes etc. only being another trait implementation away, and not requiring touching the core at all.

My proposed pipeline is

Optic "program" -Lexing/Parsing-> Executable -Execution-> Unrealized Query -Realization-> Query

Lexing/parsing and execution is done beforehand, and only depends on the optic. Realization is done by the core, calling onto the unrealized query, turning into a Tantivy query. This is necessary because Matching requires a schema as well as a FastFieldReader to be turned into a Tantivy query. Realization should be close to simply cloning data directly and filling up some blanks.

I haven't fully investigated, but it seems like neither Schemas nor FastFieldReaders change... pretty much ever (FastFieldReaders get cloned every time, but they only have an inner Arc, so... idk?), so potentially realization could be done once, which allows realization to be slower.

The advantage of this structure is that it allows the Optic language to be much more complex, and affords a large margin in the implementation of the execution (meaning said implementation is infinitely easier as it doesn't require nearly as much thinking). Since it removes Optic compilation from the code executed during a search, searches using optics can also be faster.

A possible language

Basing off of the existing language:

  1. A Matching is a condition that can either be true or false for a specific page
  2. A Matching can either be a composite of several other Matchings (more on that later), or a condition on a field of a page.
  3. An Action is something that impacts a search result, such as boosting/demoting or discarding.
  4. A Rule is a Matching and an Action. Only one action, because two different actions can always be combined into 1 rather easily.
  5. A program (mostly) outputs a list of Rules. Rules are applied independently.

These definitions effectively define all of the non-standard types of almost any possible Optic language: Matching, Action, Rule.

Notice that effectively all complex programs are essentially iterations on a list. This would therefore be an array of strings. Often, these are "iterated" over, and turned into a single matching with a straightforward mapping and combination. This is why I propose a functional-ish language for Optic, based off of Rust's functional features. Here's a syntax example.

let urls = ["https://google.com", "https://yahoo.com"];

let search_engines = urls
    map url -> Match.Site(url)
    combine Match.And;

publish search_engines, Action.Discard;

Please note that I am very much open to better syntaxes.

This language has bools, ints, floats, strings, and lists. Since it's functional, functions are also types.

Note that all variables are immutable.

Standard library: List<I>:

Implementation-wise, many combinations of these can be simplified to simpler Tantivy queries, since a BooleanQuery is more general than just And, Or or Not.

Action: Boost(value: float), Downrank(value: float), Discard, Keep.

publish is a global function/syntax sugar which takes in a Match and an action, and applies it as a Rule. Boost and Downranks combine, and Keep always takes priority over Discard.

NOTE: this means DiscardNonMatching goes from magic to just publish Match.Always, Action.Discard; at the start. However it doesn't seem easy to define as a BooleanQuery. Maybe a custom Query type is necessary for this kind of linear combination. Hey, maybe that can be another kind of Match, linearly applying one rule after another. Food for thought idk.

Something to think about is how to allow large amounts of data to be inputted without mixing code and data. One possibility is to have some kind of "data section" at the end of the optic, which can contain constants, to dump lists of websites. Something like data urls; which promises that we'll define urls in the data section.

Liking and disliking websites becomes a global function like publish, and take a string. I don't like it much, but internally liking stuff is really distinct from rules, so... yeah.

mikkeldenker commented 4 months ago

First of all, this is an incredibly well thought out proposal! I very much agree that the ergonomics of optics could be much better.

We have a bit of an odd requirement for optics in the sense that we need to make sure it is not turing complete, and that we can prove that any program written in the language halts. Otherwise, someone could easily take down the production servers. This imposes a fair bit of constraints on the features we can allow directly in the language. I would therefore propose that we essentially use the current structure as 'assembly instructions' and build some nice libraries in existing languages (python, prolog, haskell etc.) that in a way compiles the optic. The compiled optic file can then be changed to be a gzipped binary encoding of the current optic structure.

I think this approach would have multiple benefits over the approach we use today

Perhaps we should also go ahead and remove the ranking signal adjustments from optics and expose them as a HasMap<Signal, f64> parameter in the API instead. I think this is something that hardly anyone would use in optics anyway, and it makes the optic too tightly coupled with the search engine that executes it (which signals are allowed? what are their value distributions? etc.).

I agree that we should internally move as much as possible out from the core crate into optics so that core just gets a set of tantivy queries to execute. This would indeed make it way easier to maintain and remove the magic around DiscardNonMatching as you proposed.

What do you think about this? I don't really know what the syntax for the libraries should be yet, but I think we can come up with something that's really nice to use.

(huge credits to @oeb25 here)

wezm commented 4 months ago

We have a bit of an odd requirement for optics in the sense that we need to make sure it is not turing complete

This sounds a lot like configuration languages like Nickel, Dhall, Starlark and others. Would it be worth adopting one of these existing languages that already support many/all of the features described in the top post?