erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.86k stars 729 forks source link

New run, April 2023 #363

Closed erikbern closed 1 year ago

erikbern commented 1 year ago

I just ran sift-128-euclidean and here are the results:

(edit: this chart is superseded by a later chart below)

I also ran glove-100-angular, but had a minor bug – see #362 – and I'm rerunning everything just to be on the safe side. Will post it later today.

Ran it on a r6i.16xlarge with 31 runs in parallel – was able to run everything in just about 8 hours.

Note that these aren't final – I want to give library maintainers a chance to fix any issues.

This plot is extremely crowded btw – any thoughts on algos we can remove? Or alternatively we could break it up into maybe "libraries" vs "servers" or something

erikbern commented 1 year ago

~glove-100-angular, 80% complete:~

~I'm killing this for now since it's $100/day+ to run. But will rerun soon!~

EDIT: this one was incorrect due to #372 – I'll re-post new benchmarks in a day or two

erikbern commented 1 year ago

Had an 1TB EBS volume but in retrospect that was kind of ridiculous – only needed 34GB up to this point!

WPJiang commented 1 year ago

glove-100-angular, 80% complete:

I'm killing this for now since it's $100/day+ to run. But will rerun soon!

Hi @erikbern ,in this test result, the recall of glove-100-angular is close to zero, as is mentioned in #367 . We checked the code and found that the calculation method of angular distance has changed. The normalized Euclidean distance currently used is order-preserving compared to angular distance,but their values are not equal. In current code,the distance threshold used to determine top-K is directly read from groundtruth (which is angular distance), while the distances of the top-K vectors are normalized Euclidean distance. We think this is cause of the nearly zero-recall problem. To solve the problem, we can take out the top-K ID of groundtruth and use the original vector to calculate the threshold for determination. Or use the original annbenchmarks angular distance calculation method.

erikbern commented 1 year ago

Ohhh nice catch. I didn't realize changing the distance would matter as long as the order was preserved. I see now that we are putting the distances in the vector index and using it as a threshold.

I think the angular distance could be computed from dot products, is that right? Since

dot(u, v) = |u| * |v| * cos(theta)

then what we want is I think theta = arccos( dot(u, v) / (|u| * |v|) )

is that right?

And just to sanity check some values:

benbenwang1999 commented 1 year ago

good work!btw, why the result of milvus is better than the original result?

erikbern commented 1 year ago

I think Milvus made a lot of improvements since the last run which was multiple years ago

benbenwang1999 commented 1 year ago

I think Milvus made a lot of improvements since the last run which was multiple years ago

Thanks a lot, the answer is very helpful. I have another question, why the indicator chooses recall-QPS instead of precision-QPS

WPJiang commented 1 year ago

Ohhh nice catch. I didn't realize changing the distance would matter as long as the order was preserved. I see now that we are putting the distances in the vector index and using it as a threshold.

I think the angular distance could be computed from dot products, is that right? Since

dot(u, v) = |u| * |v| * cos(theta)

then what we want is I think theta = arccos( dot(u, v) / (|u| * |v|) )

is that right?

And just to sanity check some values:

  • angular distances of [0, 1] and [1, 0] is pi/2
  • angular distances of [1, 0] and [0, 1] is pi
  • angular distances of [1, 0] and [1, 1] is pi/4

@erikbern ,The angular distance in the current datasets of annbenchmarks is actually the cosine distance, which is calculated as 1 minus the cosine similarity of the vectors. d= 1-cos(u,v)=1-dot(u, v) / (|u| * |v|),we have committed a new PR #372 to fix it.

Besides, would you please consider relaxing the timeout setting to 24 hours? We found that for some datasets, some algorithms (such as NGTqg and qsgNGT) cannot complete the construction within 2 hours, but when the timeout is set to 24 hours, they get good qps-recall performance. Of course, these algorithms’ disadvantage in construction time will be reflected in the Recall-build time performance. 24 hours of construction time is indeed a bit long, but for some offline construction applications, it is acceptable to trade construction time for qps-recall performance.

erikbern commented 1 year ago

@erikbern ,The angular distance in the current datasets of annbenchmarks is actually the cosine distance, which is calculated as 1 minus the cosine similarity of the vectors. d= 1-cos(u,v)=1-dot(u, v) / (|u| * |v|),we have committed a new PR #372 to fix it.

Thanks – I also fixed the test in #373 and added an extra test to make sure it's consistent with the dataset

Besides, would you please consider relaxing the timeout setting to 24 hours? We found that for some datasets, some algorithms (such as NGTqg and qsgNGT) cannot complete the construction within 2 hours, but when the timeout is set to 24 hours, they get good qps-recall performance. Of course, these algorithms’ disadvantage in construction time will be reflected in the Recall-build time performance. 24 hours of construction time is indeed a bit long, but for some offline construction applications, it is acceptable to trade construction time for qps-recall performance.

Responded in #374 but the tl;dr is I don't think we should do that

erikbern commented 1 year ago

I have another question, why the indicator chooses recall-QPS instead of precision-QPS

I think precision and recall is the same thing given the current definition? We ask each algorithm to get the nearest 10 neighbors and we check how many of those are in the real 10 neighbors. Given that those two sets have the same size, I think recall and precision will have the same value.

thomasahle commented 1 year ago

Have you tried running some of the algorithms more than once? I wonder how much noise there is. Maybe that benefits algorithms with more hyper parameter options...

erikbern commented 1 year ago

We run every algorithm 5 times and pick the best run.

erikbern commented 1 year ago

Ran glove-100-angular overnight and here are the results:

edit: no longer latest results – see below

Something seems broken with Redisearch – will look into it (it got zero recall on every query). Weird since it seems to work on smaller tests. Similarly Weaviate's recall isn't great – will run it by them to see if I can improve the configuration. And Pynndescent seems to have degraded since the last run (high QPS, but not very high recall)

thomasahle commented 1 year ago

Looks like Pynndescent has a bug...

It also looks like it might be interesting to add some more argument options in the "high recall" regime for most algorithms. Only 5 of then currently going above 0.99.

erikbern commented 1 year ago

Fixed Redisearch and changed parameters for Weaviate – current results for glove-100-angular

image

@thomasahle agree it would be nice with more high-recall points. Maybe that's the only place Annoy currently seems to have utility :)

erikbern commented 1 year ago

Latest results for sift-128-euclidean:

image

thomasahle commented 1 year ago

Did fast_pq not work for sift-128-euclidean?

lmcinnes commented 1 year ago

Clearly something is going astray with pynndescent for Glove-100. I'll try to take a look and see if I reproduce the issue locally.

erikbern commented 1 year ago

Did fast_pq not work for sift-128-euclidean?

Not sure what happened. It's possible I messed it up somehow. I'm going to re-run it anyway, so hopefully it will work next time (if not then let's fix it)

erikbern commented 1 year ago

Clearly something is going astray with pynndescent for Glove-100. I'll try to take a look and see if I reproduce the issue locally.

This line seems suspect! But maybe it's intentional? https://github.com/erikbern/ann-benchmarks/blob/main/ann_benchmarks/algorithms/pynndescent.py#L33

lmcinnes commented 1 year ago

I believe it was intended to be deliberate but it seems like some handling of that in pynndescent has gone astray. That does look like the source of the issue though. I'll try to get something fixed imminently. In the meantime swapping back to cosine should work and just be a little slower.

EDIT: Found and fixed the problem (needed to ensure normalized versions of the data showed up in the right place in the copy_on_normalize scenario). The pynndescent 0.5.9 on PyPI should contain the fix. In principle that should just mean a rebuild of the docker container for pynndescent should fix it for the benchmarking.

erikbern commented 1 year ago

Ok, nice, sounds like the next run should work then!

thomasahle commented 1 year ago

Just realized r6i supports AVX-512, which has float16 support. Seems Disk ANN takes advantage of that and ScaNN does too. Seems very useful for this kind of workload. I guess all libraries need to support it to be competitive.

lmcinnes commented 1 year ago

Out of curiosity are the Jaccard benchmarks going to make it into this run? Since NMSlib Jaccard was added there should be a reasonable number of implementations for the Jaccard benchmark at last.

erikbern commented 1 year ago

Yeah I'll re-run all the benchmarks. Haven't verified the Jaccard ones but hopefully they still work :)

erikbern commented 1 year ago

Closing this in favor of #399