Open alexklibisz opened 1 year ago
We're now above 200 qps. Last thing left is to just submit a PR to ann-benchmarks w/ the updated versions.
Status update here, after releasing 8.12.2.1.
The non-containerized benchmark is reliably over 200qps @ 96% recall, around 210 qps. Latest update here: https://github.com/alexklibisz/elastiknn/commit/ddf637ae7053cf8f6dc038b4876520f3e41c0673
The containerized benchmark (running ann-benchmarks and elastiknn in the same container) has improved from ~160qps to ~180qps.
Here are the results using 8.6.2.0:
Model | Parameters | Recall | Queries per Second |
---|---|---|---|
eknn-l2lsh | L=100 k=4 w=1024 candidates=500 probes=0 | 0.378 | 304.111 |
eknn-l2lsh | L=100 k=4 w=1024 candidates=1000 probes=0 | 0.445 | 246.319 |
eknn-l2lsh | L=100 k=4 w=1024 candidates=500 probes=3 | 0.635 | 245.977 |
eknn-l2lsh | L=100 k=4 w=1024 candidates=1000 probes=3 | 0.716 | 201.608 |
eknn-l2lsh | L=100 k=4 w=2048 candidates=500 probes=0 | 0.767 | 265.545 |
eknn-l2lsh | L=100 k=4 w=2048 candidates=1000 probes=0 | 0.846 | 218.673 |
eknn-l2lsh | L=100 k=4 w=2048 candidates=500 probes=3 | 0.921 | 184.178 |
eknn-l2lsh | L=100 k=4 w=2048 candidates=1000 probes=3 | 0.960 | 160.437 |
Here are the results using 8.12.2.1:
Model | Parameters | Recall | Queries per Second |
---|---|---|---|
eknn-l2lsh | L=100 k=4 w=1024 candidates=500 probes=0 | 0.378 | 314.650 |
eknn-l2lsh | L=100 k=4 w=1024 candidates=1000 probes=0 | 0.446 | 247.659 |
eknn-l2lsh | L=100 k=4 w=1024 candidates=500 probes=3 | 0.634 | 258.834 |
eknn-l2lsh | L=100 k=4 w=1024 candidates=1000 probes=3 | 0.716 | 210.380 |
eknn-l2lsh | L=100 k=4 w=2048 candidates=500 probes=0 | 0.767 | 271.442 |
eknn-l2lsh | L=100 k=4 w=2048 candidates=1000 probes=0 | 0.846 | 221.127 |
eknn-l2lsh | L=100 k=4 w=2048 candidates=500 probes=3 | 0.921 | 199.353 |
eknn-l2lsh | L=100 k=4 w=2048 candidates=1000 probes=3 | 0.960 | 171.614 |
Latest update here: the non-containerized benchmark is hovering around ~195 qps. It dropped below 200 qps when I re-ran the benchmark for Elasticsearch 8.15.0: https://github.com/alexklibisz/elastiknn/commit/bbbaeea67aeaf0b8ab1dd50c1f3d900d1e17232f
I've tried several other ideas to accelerate the ArrayHitCounter. Some examples: https://github.com/alexklibisz/elastiknn/pull/721, https://github.com/alexklibisz/elastiknn/pull/615, https://github.com/alexklibisz/elastiknn/pull/598. None of it really makes a dent.
I'm thinking a major issue might be that the current LSH parameters end up matching the vast majority of documents in the index. When I sample the 0.96 benchmark in VisualVM, it's spending ~30% of its time in countHits: https://github.com/alexklibisz/elastiknn/blob/923fb22d7957238069b078b52a530b48f4705d11/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L52-L85
A good chunk of that is spend in seekExact
.
So I think I see two possible paths for the next speedup:
I ran a grid search and found some promising parameters. Verified these on AWS:
Model | Parameters | Recall | Queries per Second |
---|---|---|---|
eknn-l2lsh | L=96 k=8 w=4096 candidates=1024 probes=0 | 0.905 | 250.333 |
eknn-l2lsh | L=150 k=8 w=4000 candidates=800 probes=0 | 0.929 | 264.922 |
eknn-l2lsh | L=128 k=8 w=4096 candidates=1024 probes=0 | 0.935 | 255.407 |
eknn-l2lsh | L=150 k=8 w=4000 candidates=1000 probes=0 | 0.942 | 249.797 |
Some other parameters that were promising but I haven't verified on AWS:
eknn-l2lsh-L=96-k=4-w=4096_candidates=1024_probes=0 0.954 68.276
eknn-l2lsh-L=64-k=4-w=2048_candidates=1024_probes=4 0.942 66.007
eknn-l2lsh-L=64-k=4-w=4096_candidates=1024_probes=0 0.913 85.683
eknn-l2lsh-L=96-k=4-w=2048_candidates=1024_probes=2 0.944 66.841
eknn-l2lsh-L=96-k=2-w=2048_candidates=1024_probes=0 0.906 62.746
eknn-l2lsh-L=64-k=8-w=4096_candidates=1024_probes=2 0.936 97.321
eknn-l2lsh-L=96-k=8-w=4096_candidates=1024_probes=0 0.905 138.512
eknn-l2lsh-L=128-k=8-w=4096_candidates=512_probes=2 0.949 58.870
eknn-l2lsh-L=96-k=2-w=1024_candidates=1024_probes=2 0.910 62.992
eknn-l2lsh-L=128-k=4-w=2048_candidates=512_probes=2 0.925 61.073
eknn-l2lsh-L=128-k=4-w=4096_candidates=1024_probes=0 0.971 60.615
eknn-l2lsh-L=128-k=8-w=4096_candidates=1024_probes=0 0.935 119.305
eknn-l2lsh-L=96-k=8-w=4096_candidates=1024_probes=2 0.964 68.093
eknn-l2lsh-L=64-k=4-w=2048_candidates=1024_probes=2 0.906 112.086
eknn-l2lsh-L=128-k=4-w=4096_candidates=512_probes=0 0.925 63.552
eknn-l2lsh-L=128-k=8-w=4096_candidates=1024_probes=2 0.978 53.779
eknn-l2lsh-L=96-k=8-w=4096_candidates=512_probes=2 0.926 84.319
eknn-l2lsh-L=96-k=8-w=4096_candidates=512_probes=4 0.950 57.320
eknn-l2lsh-L=96-k=4-w=2048_candidates=512_probes=4 0.934 61.600
eknn-l2lsh-L=64-k=8-w=4096_candidates=1024_probes=4 0.959 69.827
eknn-l2lsh-L=64-k=8-w=4096_candidates=512_probes=4 0.915 92.171
I managed to find some parameters that get 239 QPS at 0.96 recall. There are a ton of results in this commit: https://github.com/alexklibisz/elastiknn/commit/c7efcf14a59c4cdcd15636042aeb6f4e381634ac
The fully-dockerized ann-benchmarks results are still quite pitiful:
Model | Parameters | Recall | Queries per Second |
---|---|---|---|
eknn-l2lsh | L=175 k=7 w=3900 candidates=100 probes=0 | 0.607 | 233.326 |
eknn-l2lsh | L=175 k=7 w=3900 candidates=500 probes=0 | 0.921 | 200.319 |
eknn-l2lsh | L=175 k=7 w=3900 candidates=1000 probes=0 | 0.962 | 169.238 |
I went ahead and opened a PR to get the latest parameters and Elastiknn version into ann-benchmarks: https://github.com/erikbern/ann-benchmarks/pull/544
I'd like to optimize Elastiknn such that the Fashion Mnist benchmark performance exceeds 200 qps at 96% recall. Currently it's at 180 qps. So this would be about an 11% improvement. There are already several issues under the performance label with ideas towards this goal. I've already merged a few PRs. I'm just opening this issue to formalize the effort and to aggregate PRs that don't otherwise have an issue.