alexklibisz / elastiknn

Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.
https://alexklibisz.github.io/elastiknn
Apache License 2.0
368 stars 48 forks source link

Try caching the query vector's FloatVector segments when computing distance #604

Closed alexklibisz closed 9 months ago

alexklibisz commented 9 months ago

Background

Here is the l2Distance method from PanamaFloatVectorOps:

https://github.com/alexklibisz/elastiknn/blob/2c20ce595aec19954b58a949fbd5cf4b6123590f/elastiknn-models/src/main/java/com/klibisz/elastiknn/vectors/PanamaFloatVectorOps.java#L71-L89)

Note how lines 77 and 78 are instantiating a FloatVector from an array. We do this in several iterations because the FloatVectors need to be a specific length in order to leverage the SIMD operations.

v1 and pv1 correspond to the query vector. v2 and pv2 correspond to the vector stored in the index. In a typical query, we compute l2Distance for exactly one query vector against many stored vectors.

So, we should be able to memoize or cache an array of segments for v1 and then avoid re-copying v1 into pv1, i.e., avoid line 77 in the method.

Unclear if this will have any positive effect, but it's worth a quick try.

Deliverables

Related Issues

No response

alexklibisz commented 9 months ago

I took a first pass at implementing this in #605. The benchmark looks promising on my M1 Macbook, but not so much on my EC2 r6i.8xlarge instance.

On M1 Macbook:

[info] Benchmark                            Mode  Cnt        Score       Error  Units
[info] CachedL2DistanceBenchmark.baseline  thrpt    6  2584985.052 ±  9945.960  ops/s
[info] CachedL2DistanceBenchmark.cached    thrpt    6  4461915.029 ± 43007.018  ops/s

On EC2 r6i.8xlarge:

[info] Benchmark                            Mode  Cnt        Score        Error  Units
[info] CachedL2DistanceBenchmark.baseline  thrpt    6  2939742.574 ± 273069.456  ops/s
[info] CachedL2DistanceBenchmark.cached    thrpt    6  2698700.860 ±  13063.707  ops/s

It seems on the M1, we actually benefit from pre-computing and fetching the chunks from memory. Not so much on standard x86 architecture on the EC2 instance.

alexklibisz commented 9 months ago

Closing as I'm skeptical that this would work.