alexklibisz / elastiknn

Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.
https://alexklibisz.github.io/elastiknn
Apache License 2.0
368 stars 48 forks source link

Lucene benchmarks for Big-ANN challenge #278

Closed alexklibisz closed 2 years ago

alexklibisz commented 3 years ago

Implement a benchmarking suite for the Big-ANN benchmarks challenge.

alexklibisz commented 2 years ago

Despite the lack of checkmarks, there has been some progress here.

I wrote up a Gist describing the development setup fo this project: https://gist.github.com/alexklibisz/d3e47e30d3c7468f3ee56844e5ee6f7a

The latest metrics are about 10 QPS on the deep-10M dataset running on my laptop using the setup described in the Gist.

I'll continue experimenting with this as time permits.

alexklibisz commented 2 years ago

One interesting avenue for optimization is Index Sorting. I'll document it here for posterity.

In September I attended Adrien Grand's talk "Speed up your Lucene queries by avoiding searching" at the 2021 Virtual ApacheCon.

Index Sorting was one of the methods that Adrien covered. In short, Index Sorting allows the user to customize how a Lucene index sorts documents stored in its segments. For example, an application might sort documents by date if they are always returned in chronological order.

How does Index Sorting apply to vector search? My current hypothesis is that Index Sorting can be used to more intelligently assign vectors to shards. Specifically, if the vector-to-shard assignment can roughly maintain spatial proximity of vectors, we should be able to minimize the number of segments which must be visited find neighbors.

The initial experiment for Index Sorting involved sorting the vectors by two values:

  1. Distance to from the vector to the origin (vector of all zeros).
  2. Cosine of the angle between the vector and the origin vector.

Sorting the index by these two values led to a ~60% query-time speedup on one of the ten million vector datasets. More testing and optimization is needed, but it looks like a fruitful direction.

alexklibisz commented 2 years ago

Closing this. Will open another, more narrowly-scoped and achievable, ticket. I transcribed the idea from my previous comment to a Github discussion: https://github.com/alexklibisz/elastiknn/discussions/375