Let's suppose we have the following query mysterious flying dragonfly. Let's also assume that mysterious is contained within 1000 documents, flying within 500 and lastly dragonfly within 10.
The default order would first compute the intersection of mysterious and dragonfly, producing a possibly large result set with hundreds of entries. Only then its intersected again with the much smaller results set of the dragonfly term to be reduced to less than 10 entries.
Instead, a search tree pre-traversal should fetch the size of the expected result sets. Its still open whether it should be done for any depth and whether it should be done for large numeric indices, where search time is logarithmic. Another problem is doing expensive pre-traversal and finding out all result sets are small, so the order doesn't matter at all.
The correct execution order of the query should be: fetching all 10 suspected dragonflies and only then intersecting it with the larger result sets. We won't end up passing intermediate results.
Efficient negation
Currently negation is implemented in the most straightforward manner: find the complement of the current result set to the set of all documents (literally negate).
This makes handling large queries problematic. Lets suppose we have a million documents and the following query is executed: dragonfly -insect. The term dragonfly still matches our good old documents, insect matches 100, so the result set of -insect is 999'900 entries 😱
Instead, we should have a more sophisticated result class with a negation flag. The search algorithm will know how to use it, so instead of computing the intersection of 10 + 999'000 values, it will filter out the negated values out of the current result set.
Its not clear until what depth it should be supported (at first only for the first level). Also juggling around with negations is a little cumbersome, but surely worth it.
Copy elision
Some indices, like numeric indices, need to generate their result set dynamically (find boundaries in sorted set and extract the entries). Text indices store the result in a ready to be used format. Instead of returning an owned list of ids, we can return just a reference to it (make the result set behave much like Rust's Cow<>)
Implement the following index optimizations:
Let's suppose we have the following query
mysterious flying dragonfly
. Let's also assume thatmysterious
is contained within 1000 documents,flying
within 500 and lastlydragonfly
within 10.The default order would first compute the intersection of
mysterious
anddragonfly
, producing a possibly large result set with hundreds of entries. Only then its intersected again with the much smaller results set of thedragonfly
term to be reduced to less than 10 entries.Instead, a search tree pre-traversal should fetch the size of the expected result sets. Its still open whether it should be done for any depth and whether it should be done for large numeric indices, where search time is logarithmic. Another problem is doing expensive pre-traversal and finding out all result sets are small, so the order doesn't matter at all.
The correct execution order of the query should be: fetching all 10 suspected dragonflies and only then intersecting it with the larger result sets. We won't end up passing intermediate results.
Currently negation is implemented in the most straightforward manner: find the complement of the current result set to the set of all documents (literally negate).
This makes handling large queries problematic. Lets suppose we have a million documents and the following query is executed:
dragonfly -insect
. The termdragonfly
still matches our good old documents,insect
matches 100, so the result set of-insect
is 999'900 entries 😱Instead, we should have a more sophisticated result class with a negation flag. The search algorithm will know how to use it, so instead of computing the intersection of 10 + 999'000 values, it will filter out the negated values out of the current result set.
Its not clear until what depth it should be supported (at first only for the first level). Also juggling around with negations is a little cumbersome, but surely worth it.
Some indices, like numeric indices, need to generate their result set dynamically (find boundaries in sorted set and extract the entries). Text indices store the result in a ready to be used format. Instead of returning an owned list of ids, we can return just a reference to it (make the result set behave much like Rust's Cow<>)