blevesearch / bleve

A modern text/numeric/geo-spatial/vector indexing library for go
Apache License 2.0
10.13k stars 688 forks source link

optimize +scoring/+location conjunctions by sharing a single roaring iterator #1112

Open steveyen opened 5 years ago

steveyen commented 5 years ago

Tracking an idea here before I forget it. Based on some of the optimization proposals happening over at https://github.com/blevesearch/bleve/pull/1110, I was thinking or wondering if there might be a way to optimize the handling of with-scoring, locations-included conjunctions of scorch term searchers.

Currently, in the classic searcher-conjunction handling codepaths, each of the leaf scorch term searchers have their own roaring iterator instances, which they separately Next() through. That is, if there are N term-searchers in a conjunction, there will be N roaring iterators, with N calls to roaring-iterator.Next() when parent conjunction Next() moves forward one hit.

Instead, we might be able to share a single roaring iterator here for all the N child term-searchers, since these optimized term searcher all will be sharing the same, computed, AND'ed-together "actual" bitmap instance.

Although that'll help for the "actual" side of the house, there would still need to be N roaring-iterators, though, for the "all" bitmaps that are needed to maintain the freq-norm and term-vector readers.

It might also be possible that this idea of a +scoring/+location conjunction optimization might be unify-able with the proposed "unadorned" (-scoring/-location) conjunction codepaths in PR #1110 , but haven't thought that far yet.

steveyen commented 5 years ago

Latest thinking / notes on this, from digging around...

https://github.com/blevesearch/bleve/blob/master/search/scorer/scorer_conjunction.go#L53

Perhaps the non-scoring conjunction of term-searchers is just an edge case of two more slightly general things...

steveyen commented 5 years ago

After more thoughts, looks like it (e.g., a "uniform scoring conjunction") is not going to work without some significant API "plumbing" changes or would only work in a narrow edge case of "keyword" type of analyzers.

Imagine a field had text like... "aa aa bb cc"

Then, when there's a conjunction search like, "+aa +bb +cc", the multiple roaring bitmaps for those terms can by AND'ed together (the conjunction optimization).

But, while Next()'ing through iterators, the "aa" term would have frequency 2, and "bb and "cc" would have frequency 1. So, a simple "uniform" scoring factor or multiplier wouldn't work in this case.

And, the existing TermFieldReader interface doesn't have enough way to pass that kind of signal upstairs -- it might have to be some new kind of interface like a MultiTermFieldReader, which feels like going too far.

Based on that, I think I'll push this chain of thought to the backburner.