alexklibisz / elastiknn

Elasticsearch plugin for nearest neighbor search. Store vectors and run similarity search using exact and approximate algorithms.
https://alexklibisz.github.io/elastiknn
Apache License 2.0
368 stars 48 forks source link

Get Fashion Mnist 96% recall up to 200 queries/second #611

Open alexklibisz opened 9 months ago

alexklibisz commented 9 months ago

I'd like to optimize Elastiknn such that the Fashion Mnist benchmark performance exceeds 200 qps at 96% recall. Currently it's at 180 qps. So this would be about an 11% improvement. There are already several issues under the performance label with ideas towards this goal. I've already merged a few PRs. I'm just opening this issue to formalize the effort and to aggregate PRs that don't otherwise have an issue.

alexklibisz commented 8 months ago

We're now above 200 qps. Last thing left is to just submit a PR to ann-benchmarks w/ the updated versions.

alexklibisz commented 5 months ago

Status update here, after releasing 8.12.2.1.

The non-containerized benchmark is reliably over 200qps @ 96% recall, around 210 qps. Latest update here: https://github.com/alexklibisz/elastiknn/commit/ddf637ae7053cf8f6dc038b4876520f3e41c0673

The containerized benchmark (running ann-benchmarks and elastiknn in the same container) has improved from ~160qps to ~180qps.

Here are the results using 8.6.2.0:

Model Parameters Recall Queries per Second
eknn-l2lsh L=100 k=4 w=1024 candidates=500 probes=0 0.378 304.111
eknn-l2lsh L=100 k=4 w=1024 candidates=1000 probes=0 0.445 246.319
eknn-l2lsh L=100 k=4 w=1024 candidates=500 probes=3 0.635 245.977
eknn-l2lsh L=100 k=4 w=1024 candidates=1000 probes=3 0.716 201.608
eknn-l2lsh L=100 k=4 w=2048 candidates=500 probes=0 0.767 265.545
eknn-l2lsh L=100 k=4 w=2048 candidates=1000 probes=0 0.846 218.673
eknn-l2lsh L=100 k=4 w=2048 candidates=500 probes=3 0.921 184.178
eknn-l2lsh L=100 k=4 w=2048 candidates=1000 probes=3 0.960 160.437

Here are the results using 8.12.2.1:

Model Parameters Recall Queries per Second
eknn-l2lsh L=100 k=4 w=1024 candidates=500 probes=0 0.378 314.650
eknn-l2lsh L=100 k=4 w=1024 candidates=1000 probes=0 0.446 247.659
eknn-l2lsh L=100 k=4 w=1024 candidates=500 probes=3 0.634 258.834
eknn-l2lsh L=100 k=4 w=1024 candidates=1000 probes=3 0.716 210.380
eknn-l2lsh L=100 k=4 w=2048 candidates=500 probes=0 0.767 271.442
eknn-l2lsh L=100 k=4 w=2048 candidates=1000 probes=0 0.846 221.127
eknn-l2lsh L=100 k=4 w=2048 candidates=500 probes=3 0.921 199.353
eknn-l2lsh L=100 k=4 w=2048 candidates=1000 probes=3 0.960 171.614
alexklibisz commented 3 days ago

Latest update here: the non-containerized benchmark is hovering around ~195 qps. It dropped below 200 qps when I re-ran the benchmark for Elasticsearch 8.15.0: https://github.com/alexklibisz/elastiknn/commit/bbbaeea67aeaf0b8ab1dd50c1f3d900d1e17232f

I've tried several other ideas to accelerate the ArrayHitCounter. Some examples: https://github.com/alexklibisz/elastiknn/pull/721, https://github.com/alexklibisz/elastiknn/pull/615, https://github.com/alexklibisz/elastiknn/pull/598. None of it really makes a dent.

I'm thinking a major issue might be that the current LSH parameters end up matching the vast majority of documents in the index. When I sample the 0.96 benchmark in VisualVM, it's spending ~30% of its time in countHits: https://github.com/alexklibisz/elastiknn/blob/923fb22d7957238069b078b52a530b48f4705d11/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L52-L85

A good chunk of that is spend in seekExact.

So I think I see two possible paths for the next speedup:

alexklibisz commented 3 days ago

I ran a grid search and found some promising parameters. Verified these on AWS:

Model Parameters Recall Queries per Second
eknn-l2lsh L=96 k=8 w=4096 candidates=1024 probes=0 0.905 250.333
eknn-l2lsh L=150 k=8 w=4000 candidates=800 probes=0 0.929 264.922
eknn-l2lsh L=128 k=8 w=4096 candidates=1024 probes=0 0.935 255.407
eknn-l2lsh L=150 k=8 w=4000 candidates=1000 probes=0 0.942 249.797

Some other parameters that were promising but I haven't verified on AWS:

eknn-l2lsh-L=96-k=4-w=4096_candidates=1024_probes=0        0.954       68.276
eknn-l2lsh-L=64-k=4-w=2048_candidates=1024_probes=4        0.942       66.007
eknn-l2lsh-L=64-k=4-w=4096_candidates=1024_probes=0        0.913       85.683
eknn-l2lsh-L=96-k=4-w=2048_candidates=1024_probes=2        0.944       66.841
eknn-l2lsh-L=96-k=2-w=2048_candidates=1024_probes=0        0.906       62.746
eknn-l2lsh-L=64-k=8-w=4096_candidates=1024_probes=2        0.936       97.321
eknn-l2lsh-L=96-k=8-w=4096_candidates=1024_probes=0        0.905      138.512
eknn-l2lsh-L=128-k=8-w=4096_candidates=512_probes=2        0.949       58.870
eknn-l2lsh-L=96-k=2-w=1024_candidates=1024_probes=2        0.910       62.992
eknn-l2lsh-L=128-k=4-w=2048_candidates=512_probes=2        0.925       61.073
eknn-l2lsh-L=128-k=4-w=4096_candidates=1024_probes=0        0.971       60.615
eknn-l2lsh-L=128-k=8-w=4096_candidates=1024_probes=0        0.935      119.305
eknn-l2lsh-L=96-k=8-w=4096_candidates=1024_probes=2        0.964       68.093
eknn-l2lsh-L=64-k=4-w=2048_candidates=1024_probes=2        0.906      112.086
eknn-l2lsh-L=128-k=4-w=4096_candidates=512_probes=0        0.925       63.552
eknn-l2lsh-L=128-k=8-w=4096_candidates=1024_probes=2        0.978       53.779
eknn-l2lsh-L=96-k=8-w=4096_candidates=512_probes=2        0.926       84.319
eknn-l2lsh-L=96-k=8-w=4096_candidates=512_probes=4        0.950       57.320
eknn-l2lsh-L=96-k=4-w=2048_candidates=512_probes=4        0.934       61.600
eknn-l2lsh-L=64-k=8-w=4096_candidates=1024_probes=4        0.959       69.827
eknn-l2lsh-L=64-k=8-w=4096_candidates=512_probes=4        0.915       92.171
alexklibisz commented 1 day ago

I managed to find some parameters that get 239 QPS at 0.96 recall. There are a ton of results in this commit: https://github.com/alexklibisz/elastiknn/commit/c7efcf14a59c4cdcd15636042aeb6f4e381634ac

alexklibisz commented 1 day ago

The fully-dockerized ann-benchmarks results are still quite pitiful:

Model Parameters Recall Queries per Second
eknn-l2lsh L=175 k=7 w=3900 candidates=100 probes=0 0.607 233.326
eknn-l2lsh L=175 k=7 w=3900 candidates=500 probes=0 0.921 200.319
eknn-l2lsh L=175 k=7 w=3900 candidates=1000 probes=0 0.962 169.238

I went ahead and opened a PR to get the latest parameters and Elastiknn version into ann-benchmarks: https://github.com/erikbern/ann-benchmarks/pull/544