Search: Basic optimizations

Implement the following index optimizations:

Optimal intersection order

Let's suppose we have the following query mysterious flying dragonfly. Let's also assume that mysterious is contained within 1000 documents, flying within 500 and lastly dragonfly within 10.

The default order would first compute the intersection of mysterious and dragonfly, producing a possibly large result set with hundreds of entries. Only then its intersected again with the much smaller results set of the dragonfly term to be reduced to less than 10 entries.

Instead, a search tree pre-traversal should fetch the size of the expected result sets. Its still open whether it should be done for any depth and whether it should be done for large numeric indices, where search time is logarithmic. Another problem is doing expensive pre-traversal and finding out all result sets are small, so the order doesn't matter at all.

The correct execution order of the query should be: fetching all 10 suspected dragonflies and only then intersecting it with the larger result sets. We won't end up passing intermediate results.

Efficient negation

Currently negation is implemented in the most straightforward manner: find the complement of the current result set to the set of all documents (literally negate).

This makes handling large queries problematic. Lets suppose we have a million documents and the following query is executed: dragonfly -insect. The term dragonfly still matches our good old documents, insect matches 100, so the result set of -insect is 999'900 entries 😱

Instead, we should have a more sophisticated result class with a negation flag. The search algorithm will know how to use it, so instead of computing the intersection of 10 + 999'000 values, it will filter out the negated values out of the current result set.

Its not clear until what depth it should be supported (at first only for the first level). Also juggling around with negations is a little cumbersome, but surely worth it.

Copy elision

Some indices, like numeric indices, need to generate their result set dynamically (find boundaries in sorted set and extract the entries). Text indices store the result in a ready to be used format. Instead of returning an owned list of ids, we can return just a reference to it (make the result set behave much like Rust's Cow<>)

dragonflydb / dragonfly

Search: Basic optimizations #1343