cozydev-pink / protosearch

prototype search library in pure scala
https://cozydev-pink.github.io/protosearch/
Apache License 2.0
9 stars 6 forks source link

WIP Add support for positional indexes #166

Closed valencik closed 7 months ago

valencik commented 9 months ago

Phrase search and positional indexing!

image

This PR introduces a new index type PositionalIndex and renames the old single index to FrequencyIndex. Both of these now implement Index which is what most of the supporting classes like MultiIndex, IndexSearcher and Scorer use.

A PositionalIndex, using it's PositionalPostingsLists, keeps track of the position each term in a document occurred at. This can then be used at query time to ensure that term matches occur one after another and thus form an exact phrase.

Of note, this shouldn't be too hard to extend to support "slop", but that will be left for another PR.

Because this is the first new type of Index structure, a lot of things needed to be changed to support this. Here's a brief review of the components:

Index

The Index trait is new and was needed to enable other components to support both PositionalIndex and FrequencyIndex. Additionally, a lot of the term lookup handling has been broken out into a new TermDictionary that is used by both indexes.

IndexSearch

This used to be called BooleanRetrieval which was a rough name. The IndexSearch is where take in a Query and perform a search against some Index. We do this by traversing the Query and pattern matching on the various clauses. When we hit a clause like Phrase we chose an implementation based on the type of Index we have.

Scorer

Unfortunately the current implementation of Score is really a bunch of tech debt. I would like to completely rebuild it in the future to not require a second pass over the query and index. Because of this, it has only minimally changed to support Phrase scoring.

MultiIndex

Users build an index through MultiIndex which allows then to specify multiple Fields. And it's in the Field specification that we determine whether or not to index positions.