This PR introduces a new index type PositionalIndex and renames the old single index to FrequencyIndex.
Both of these now implement Index which is what most of the supporting classes like MultiIndex, IndexSearcher and Scorer use.
A PositionalIndex, using it's PositionalPostingsLists, keeps track of the position each term in a document occurred at.
This can then be used at query time to ensure that term matches occur one after another and thus form an exact phrase.
Of note, this shouldn't be too hard to extend to support "slop", but that will be left for another PR.
Because this is the first new type of Index structure, a lot of things needed to be changed to support this.
Here's a brief review of the components:
Index
The Index trait is new and was needed to enable other components to support both PositionalIndex and FrequencyIndex.
Additionally, a lot of the term lookup handling has been broken out into a new TermDictionary that is used by both indexes.
IndexSearch
This used to be called BooleanRetrieval which was a rough name.
The IndexSearch is where take in a Query and perform a search against some Index.
We do this by traversing the Query and pattern matching on the various clauses.
When we hit a clause like Phrase we chose an implementation based on the type of Index we have.
Scorer
Unfortunately the current implementation of Score is really a bunch of tech debt.
I would like to completely rebuild it in the future to not require a second pass over the query and index.
Because of this, it has only minimally changed to support Phrase scoring.
MultiIndex
Users build an index through MultiIndex which allows then to specify multiple Fields.
And it's in the Field specification that we determine whether or not to index positions.
Phrase search and positional indexing!
This PR introduces a new index type
PositionalIndex
and renames the old single index toFrequencyIndex
. Both of these now implementIndex
which is what most of the supporting classes likeMultiIndex
,IndexSearcher
andScorer
use.A
PositionalIndex
, using it'sPositionalPostingsList
s, keeps track of the position each term in a document occurred at. This can then be used at query time to ensure that term matches occur one after another and thus form an exact phrase.Of note, this shouldn't be too hard to extend to support "slop", but that will be left for another PR.
Because this is the first new type of Index structure, a lot of things needed to be changed to support this. Here's a brief review of the components:
Index
The
Index
trait is new and was needed to enable other components to support bothPositionalIndex
andFrequencyIndex
. Additionally, a lot of the term lookup handling has been broken out into a newTermDictionary
that is used by both indexes.IndexSearch
This used to be called
BooleanRetrieval
which was a rough name. TheIndexSearch
is where take in aQuery
and perform a search against someIndex
. We do this by traversing theQuery
and pattern matching on the various clauses. When we hit a clause likePhrase
we chose an implementation based on the type of Index we have.Scorer
Unfortunately the current implementation of
Score
is really a bunch of tech debt. I would like to completely rebuild it in the future to not require a second pass over the query and index. Because of this, it has only minimally changed to supportPhrase
scoring.MultiIndex
Users build an index through
MultiIndex
which allows then to specify multipleField
s. And it's in theField
specification that we determine whether or not to index positions.