alexklibisz / elastiknn

Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.
https://alexklibisz.github.io/elastiknn
Apache License 2.0
373 stars 48 forks source link

Get Fashion Mnist 96% recall up to 200 queries/second #611

Open alexklibisz opened 1 year ago

alexklibisz commented 1 year ago

I'd like to optimize Elastiknn such that the Fashion Mnist benchmark performance exceeds 200 qps at 96% recall. Currently it's at 180 qps. So this would be about an 11% improvement. There are already several issues under the performance label with ideas towards this goal. I've already merged a few PRs. I'm just opening this issue to formalize the effort and to aggregate PRs that don't otherwise have an issue.

alexklibisz commented 12 months ago

We're now above 200 qps. Last thing left is to just submit a PR to ann-benchmarks w/ the updated versions.

alexklibisz commented 8 months ago

Status update here, after releasing 8.12.2.1.

The non-containerized benchmark is reliably over 200qps @ 96% recall, around 210 qps. Latest update here: https://github.com/alexklibisz/elastiknn/commit/ddf637ae7053cf8f6dc038b4876520f3e41c0673

The containerized benchmark (running ann-benchmarks and elastiknn in the same container) has improved from ~160qps to ~180qps.

Here are the results using 8.6.2.0:

Model Parameters Recall Queries per Second
eknn-l2lsh L=100 k=4 w=1024 candidates=500 probes=0 0.378 304.111
eknn-l2lsh L=100 k=4 w=1024 candidates=1000 probes=0 0.445 246.319
eknn-l2lsh L=100 k=4 w=1024 candidates=500 probes=3 0.635 245.977
eknn-l2lsh L=100 k=4 w=1024 candidates=1000 probes=3 0.716 201.608
eknn-l2lsh L=100 k=4 w=2048 candidates=500 probes=0 0.767 265.545
eknn-l2lsh L=100 k=4 w=2048 candidates=1000 probes=0 0.846 218.673
eknn-l2lsh L=100 k=4 w=2048 candidates=500 probes=3 0.921 184.178
eknn-l2lsh L=100 k=4 w=2048 candidates=1000 probes=3 0.960 160.437

Here are the results using 8.12.2.1:

Model Parameters Recall Queries per Second
eknn-l2lsh L=100 k=4 w=1024 candidates=500 probes=0 0.378 314.650
eknn-l2lsh L=100 k=4 w=1024 candidates=1000 probes=0 0.446 247.659
eknn-l2lsh L=100 k=4 w=1024 candidates=500 probes=3 0.634 258.834
eknn-l2lsh L=100 k=4 w=1024 candidates=1000 probes=3 0.716 210.380
eknn-l2lsh L=100 k=4 w=2048 candidates=500 probes=0 0.767 271.442
eknn-l2lsh L=100 k=4 w=2048 candidates=1000 probes=0 0.846 221.127
eknn-l2lsh L=100 k=4 w=2048 candidates=500 probes=3 0.921 199.353
eknn-l2lsh L=100 k=4 w=2048 candidates=1000 probes=3 0.960 171.614
alexklibisz commented 3 months ago

Latest update here: the non-containerized benchmark is hovering around ~195 qps. It dropped below 200 qps when I re-ran the benchmark for Elasticsearch 8.15.0: https://github.com/alexklibisz/elastiknn/commit/bbbaeea67aeaf0b8ab1dd50c1f3d900d1e17232f

I've tried several other ideas to accelerate the ArrayHitCounter. Some examples: https://github.com/alexklibisz/elastiknn/pull/721, https://github.com/alexklibisz/elastiknn/pull/615, https://github.com/alexklibisz/elastiknn/pull/598. None of it really makes a dent.

I'm thinking a major issue might be that the current LSH parameters end up matching the vast majority of documents in the index. When I sample the 0.96 benchmark in VisualVM, it's spending ~30% of its time in countHits: https://github.com/alexklibisz/elastiknn/blob/923fb22d7957238069b078b52a530b48f4705d11/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L52-L85

A good chunk of that is spend in seekExact.

So I think I see two possible paths for the next speedup:

alexklibisz commented 3 months ago

I ran a grid search and found some promising parameters. Verified these on AWS:

Model Parameters Recall Queries per Second
eknn-l2lsh L=96 k=8 w=4096 candidates=1024 probes=0 0.905 250.333
eknn-l2lsh L=150 k=8 w=4000 candidates=800 probes=0 0.929 264.922
eknn-l2lsh L=128 k=8 w=4096 candidates=1024 probes=0 0.935 255.407
eknn-l2lsh L=150 k=8 w=4000 candidates=1000 probes=0 0.942 249.797

Some other parameters that were promising but I haven't verified on AWS:

eknn-l2lsh-L=96-k=4-w=4096_candidates=1024_probes=0        0.954       68.276
eknn-l2lsh-L=64-k=4-w=2048_candidates=1024_probes=4        0.942       66.007
eknn-l2lsh-L=64-k=4-w=4096_candidates=1024_probes=0        0.913       85.683
eknn-l2lsh-L=96-k=4-w=2048_candidates=1024_probes=2        0.944       66.841
eknn-l2lsh-L=96-k=2-w=2048_candidates=1024_probes=0        0.906       62.746
eknn-l2lsh-L=64-k=8-w=4096_candidates=1024_probes=2        0.936       97.321
eknn-l2lsh-L=96-k=8-w=4096_candidates=1024_probes=0        0.905      138.512
eknn-l2lsh-L=128-k=8-w=4096_candidates=512_probes=2        0.949       58.870
eknn-l2lsh-L=96-k=2-w=1024_candidates=1024_probes=2        0.910       62.992
eknn-l2lsh-L=128-k=4-w=2048_candidates=512_probes=2        0.925       61.073
eknn-l2lsh-L=128-k=4-w=4096_candidates=1024_probes=0        0.971       60.615
eknn-l2lsh-L=128-k=8-w=4096_candidates=1024_probes=0        0.935      119.305
eknn-l2lsh-L=96-k=8-w=4096_candidates=1024_probes=2        0.964       68.093
eknn-l2lsh-L=64-k=4-w=2048_candidates=1024_probes=2        0.906      112.086
eknn-l2lsh-L=128-k=4-w=4096_candidates=512_probes=0        0.925       63.552
eknn-l2lsh-L=128-k=8-w=4096_candidates=1024_probes=2        0.978       53.779
eknn-l2lsh-L=96-k=8-w=4096_candidates=512_probes=2        0.926       84.319
eknn-l2lsh-L=96-k=8-w=4096_candidates=512_probes=4        0.950       57.320
eknn-l2lsh-L=96-k=4-w=2048_candidates=512_probes=4        0.934       61.600
eknn-l2lsh-L=64-k=8-w=4096_candidates=1024_probes=4        0.959       69.827
eknn-l2lsh-L=64-k=8-w=4096_candidates=512_probes=4        0.915       92.171
alexklibisz commented 3 months ago

I managed to find some parameters that get 239 QPS at 0.96 recall. There are a ton of results in this commit: https://github.com/alexklibisz/elastiknn/commit/c7efcf14a59c4cdcd15636042aeb6f4e381634ac

alexklibisz commented 3 months ago

The fully-dockerized ann-benchmarks results are still quite pitiful:

Model Parameters Recall Queries per Second
eknn-l2lsh L=175 k=7 w=3900 candidates=100 probes=0 0.607 233.326
eknn-l2lsh L=175 k=7 w=3900 candidates=500 probes=0 0.921 200.319
eknn-l2lsh L=175 k=7 w=3900 candidates=1000 probes=0 0.962 169.238

I went ahead and opened a PR to get the latest parameters and Elastiknn version into ann-benchmarks: https://github.com/erikbern/ann-benchmarks/pull/544