Get Fashion Mnist 96% recall up to 200 queries/second

alexklibisz commented 1 year ago

I'd like to optimize Elastiknn such that the Fashion Mnist benchmark performance exceeds 200 qps at 96% recall. Currently it's at 180 qps. So this would be about an 11% improvement. There are already several issues under the performance label with ideas towards this goal. I've already merged a few PRs. I'm just opening this issue to formalize the effort and to aggregate PRs that don't otherwise have an issue.

alexklibisz commented 12 months ago

We're now above 200 qps. Last thing left is to just submit a PR to ann-benchmarks w/ the updated versions.

alexklibisz commented 8 months ago

Status update here, after releasing 8.12.2.1.

The non-containerized benchmark is reliably over 200qps @ 96% recall, around 210 qps. Latest update here: https://github.com/alexklibisz/elastiknn/commit/ddf637ae7053cf8f6dc038b4876520f3e41c0673

The containerized benchmark (running ann-benchmarks and elastiknn in the same container) has improved from ~160qps to ~180qps.

Here are the results using 8.6.2.0:

Model	Parameters	Recall	Queries per Second
eknn-l2lsh	L=100 k=4 w=1024 candidates=500 probes=0	0.378	304.111
eknn-l2lsh	L=100 k=4 w=1024 candidates=1000 probes=0	0.445	246.319
eknn-l2lsh	L=100 k=4 w=1024 candidates=500 probes=3	0.635	245.977
eknn-l2lsh	L=100 k=4 w=1024 candidates=1000 probes=3	0.716	201.608
eknn-l2lsh	L=100 k=4 w=2048 candidates=500 probes=0	0.767	265.545
eknn-l2lsh	L=100 k=4 w=2048 candidates=1000 probes=0	0.846	218.673
eknn-l2lsh	L=100 k=4 w=2048 candidates=500 probes=3	0.921	184.178
eknn-l2lsh	L=100 k=4 w=2048 candidates=1000 probes=3	0.960	160.437

Here are the results using 8.12.2.1:

Model	Parameters	Recall	Queries per Second
eknn-l2lsh	L=100 k=4 w=1024 candidates=500 probes=0	0.378	314.650
eknn-l2lsh	L=100 k=4 w=1024 candidates=1000 probes=0	0.446	247.659
eknn-l2lsh	L=100 k=4 w=1024 candidates=500 probes=3	0.634	258.834
eknn-l2lsh	L=100 k=4 w=1024 candidates=1000 probes=3	0.716	210.380
eknn-l2lsh	L=100 k=4 w=2048 candidates=500 probes=0	0.767	271.442
eknn-l2lsh	L=100 k=4 w=2048 candidates=1000 probes=0	0.846	221.127
eknn-l2lsh	L=100 k=4 w=2048 candidates=500 probes=3	0.921	199.353
eknn-l2lsh	L=100 k=4 w=2048 candidates=1000 probes=3	0.960	171.614

alexklibisz commented 3 months ago

Latest update here: the non-containerized benchmark is hovering around ~195 qps. It dropped below 200 qps when I re-ran the benchmark for Elasticsearch 8.15.0: https://github.com/alexklibisz/elastiknn/commit/bbbaeea67aeaf0b8ab1dd50c1f3d900d1e17232f

I've tried several other ideas to accelerate the ArrayHitCounter. Some examples: https://github.com/alexklibisz/elastiknn/pull/721, https://github.com/alexklibisz/elastiknn/pull/615, https://github.com/alexklibisz/elastiknn/pull/598. None of it really makes a dent.

I'm thinking a major issue might be that the current LSH parameters end up matching the vast majority of documents in the index. When I sample the 0.96 benchmark in VisualVM, it's spending ~30% of its time in countHits: https://github.com/alexklibisz/elastiknn/blob/923fb22d7957238069b078b52a530b48f4705d11/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L52-L85

A good chunk of that is spend in seekExact.

So I think I see two possible paths for the next speedup:

Using Lucene more efficiently.
Find some better parameters for the LSH model. Maybe there's some area of the parameter space that avoids matching the vast majority of documents in the index.

alexklibisz commented 3 months ago

I ran a grid search and found some promising parameters. Verified these on AWS:

Model	Parameters	Recall	Queries per Second
eknn-l2lsh	L=96 k=8 w=4096 candidates=1024 probes=0	0.905	250.333
eknn-l2lsh	L=150 k=8 w=4000 candidates=800 probes=0	0.929	264.922
eknn-l2lsh	L=128 k=8 w=4096 candidates=1024 probes=0	0.935	255.407
eknn-l2lsh	L=150 k=8 w=4000 candidates=1000 probes=0	0.942	249.797

Some other parameters that were promising but I haven't verified on AWS:

eknn-l2lsh-L=96-k=4-w=4096_candidates=1024_probes=0        0.954       68.276
eknn-l2lsh-L=64-k=4-w=2048_candidates=1024_probes=4        0.942       66.007
eknn-l2lsh-L=64-k=4-w=4096_candidates=1024_probes=0        0.913       85.683
eknn-l2lsh-L=96-k=4-w=2048_candidates=1024_probes=2        0.944       66.841
eknn-l2lsh-L=96-k=2-w=2048_candidates=1024_probes=0        0.906       62.746
eknn-l2lsh-L=64-k=8-w=4096_candidates=1024_probes=2        0.936       97.321
eknn-l2lsh-L=96-k=8-w=4096_candidates=1024_probes=0        0.905      138.512
eknn-l2lsh-L=128-k=8-w=4096_candidates=512_probes=2        0.949       58.870
eknn-l2lsh-L=96-k=2-w=1024_candidates=1024_probes=2        0.910       62.992
eknn-l2lsh-L=128-k=4-w=2048_candidates=512_probes=2        0.925       61.073
eknn-l2lsh-L=128-k=4-w=4096_candidates=1024_probes=0        0.971       60.615
eknn-l2lsh-L=128-k=8-w=4096_candidates=1024_probes=0        0.935      119.305
eknn-l2lsh-L=96-k=8-w=4096_candidates=1024_probes=2        0.964       68.093
eknn-l2lsh-L=64-k=4-w=2048_candidates=1024_probes=2        0.906      112.086
eknn-l2lsh-L=128-k=4-w=4096_candidates=512_probes=0        0.925       63.552
eknn-l2lsh-L=128-k=8-w=4096_candidates=1024_probes=2        0.978       53.779
eknn-l2lsh-L=96-k=8-w=4096_candidates=512_probes=2        0.926       84.319
eknn-l2lsh-L=96-k=8-w=4096_candidates=512_probes=4        0.950       57.320
eknn-l2lsh-L=96-k=4-w=2048_candidates=512_probes=4        0.934       61.600
eknn-l2lsh-L=64-k=8-w=4096_candidates=1024_probes=4        0.959       69.827
eknn-l2lsh-L=64-k=8-w=4096_candidates=512_probes=4        0.915       92.171

alexklibisz commented 3 months ago

I managed to find some parameters that get 239 QPS at 0.96 recall. There are a ton of results in this commit: https://github.com/alexklibisz/elastiknn/commit/c7efcf14a59c4cdcd15636042aeb6f4e381634ac

alexklibisz commented 3 months ago

The fully-dockerized ann-benchmarks results are still quite pitiful:

Model	Parameters	Recall	Queries per Second
eknn-l2lsh	L=175 k=7 w=3900 candidates=100 probes=0	0.607	233.326
eknn-l2lsh	L=175 k=7 w=3900 candidates=500 probes=0	0.921	200.319
eknn-l2lsh	L=175 k=7 w=3900 candidates=1000 probes=0	0.962	169.238

I went ahead and opened a PR to get the latest parameters and Elastiknn version into ann-benchmarks: https://github.com/erikbern/ann-benchmarks/pull/544

alexklibisz / elastiknn

Get Fashion Mnist 96% recall up to 200 queries/second #611