Closed erikbern closed 1 year ago
~glove-100-angular, 80% complete:~
~I'm killing this for now since it's $100/day+ to run. But will rerun soon!~
EDIT: this one was incorrect due to #372 – I'll re-post new benchmarks in a day or two
Had an 1TB EBS volume but in retrospect that was kind of ridiculous – only needed 34GB up to this point!
glove-100-angular, 80% complete:
I'm killing this for now since it's $100/day+ to run. But will rerun soon!
Hi @erikbern ,in this test result, the recall of glove-100-angular is close to zero, as is mentioned in #367 . We checked the code and found that the calculation method of angular distance has changed. The normalized Euclidean distance currently used is order-preserving compared to angular distance,but their values are not equal. In current code,the distance threshold used to determine top-K is directly read from groundtruth (which is angular distance), while the distances of the top-K vectors are normalized Euclidean distance. We think this is cause of the nearly zero-recall problem. To solve the problem, we can take out the top-K ID of groundtruth and use the original vector to calculate the threshold for determination. Or use the original annbenchmarks angular distance calculation method.
Ohhh nice catch. I didn't realize changing the distance would matter as long as the order was preserved. I see now that we are putting the distances in the vector index and using it as a threshold.
I think the angular distance could be computed from dot products, is that right? Since
dot(u, v) = |u| * |v| * cos(theta)
then what we want is I think theta = arccos( dot(u, v) / (|u| * |v|) )
is that right?
And just to sanity check some values:
[0, 1]
and [1, 0]
is pi/2
[1, 0]
and [0, 1]
is pi
[1, 0]
and [1, 1]
is pi/4
good work!btw, why the result of milvus is better than the original result?
I think Milvus made a lot of improvements since the last run which was multiple years ago
I think Milvus made a lot of improvements since the last run which was multiple years ago
Thanks a lot, the answer is very helpful. I have another question, why the indicator chooses recall-QPS instead of precision-QPS
Ohhh nice catch. I didn't realize changing the distance would matter as long as the order was preserved. I see now that we are putting the distances in the vector index and using it as a threshold.
I think the angular distance could be computed from dot products, is that right? Since
dot(u, v) = |u| * |v| * cos(theta)
then what we want is I think
theta = arccos( dot(u, v) / (|u| * |v|) )
is that right?
And just to sanity check some values:
- angular distances of
[0, 1]
and[1, 0]
ispi/2
- angular distances of
[1, 0]
and[0, 1]
ispi
- angular distances of
[1, 0]
and[1, 1]
ispi/4
@erikbern ,The angular distance in the current datasets of annbenchmarks is actually the cosine distance, which is calculated as 1 minus the cosine similarity of the vectors. d= 1-cos(u,v)=1-dot(u, v) / (|u| * |v|),we have committed a new PR #372 to fix it.
Besides, would you please consider relaxing the timeout setting to 24 hours? We found that for some datasets, some algorithms (such as NGTqg and qsgNGT) cannot complete the construction within 2 hours, but when the timeout is set to 24 hours, they get good qps-recall performance. Of course, these algorithms’ disadvantage in construction time will be reflected in the Recall-build time performance. 24 hours of construction time is indeed a bit long, but for some offline construction applications, it is acceptable to trade construction time for qps-recall performance.
@erikbern ,The angular distance in the current datasets of annbenchmarks is actually the cosine distance, which is calculated as 1 minus the cosine similarity of the vectors. d= 1-cos(u,v)=1-dot(u, v) / (|u| * |v|),we have committed a new PR #372 to fix it.
Thanks – I also fixed the test in #373 and added an extra test to make sure it's consistent with the dataset
Besides, would you please consider relaxing the timeout setting to 24 hours? We found that for some datasets, some algorithms (such as NGTqg and qsgNGT) cannot complete the construction within 2 hours, but when the timeout is set to 24 hours, they get good qps-recall performance. Of course, these algorithms’ disadvantage in construction time will be reflected in the Recall-build time performance. 24 hours of construction time is indeed a bit long, but for some offline construction applications, it is acceptable to trade construction time for qps-recall performance.
Responded in #374 but the tl;dr is I don't think we should do that
I have another question, why the indicator chooses recall-QPS instead of precision-QPS
I think precision and recall is the same thing given the current definition? We ask each algorithm to get the nearest 10 neighbors and we check how many of those are in the real 10 neighbors. Given that those two sets have the same size, I think recall and precision will have the same value.
Have you tried running some of the algorithms more than once? I wonder how much noise there is. Maybe that benefits algorithms with more hyper parameter options...
We run every algorithm 5 times and pick the best run.
Ran glove-100-angular overnight and here are the results:
edit: no longer latest results – see below
Something seems broken with Redisearch – will look into it (it got zero recall on every query). Weird since it seems to work on smaller tests. Similarly Weaviate's recall isn't great – will run it by them to see if I can improve the configuration. And Pynndescent seems to have degraded since the last run (high QPS, but not very high recall)
Looks like Pynndescent has a bug...
It also looks like it might be interesting to add some more argument options in the "high recall" regime for most algorithms. Only 5 of then currently going above 0.99.
Fixed Redisearch and changed parameters for Weaviate – current results for glove-100-angular
@thomasahle agree it would be nice with more high-recall points. Maybe that's the only place Annoy currently seems to have utility :)
Latest results for sift-128-euclidean:
Did fast_pq not work for sift-128-euclidean?
Clearly something is going astray with pynndescent for Glove-100. I'll try to take a look and see if I reproduce the issue locally.
Did fast_pq not work for sift-128-euclidean?
Not sure what happened. It's possible I messed it up somehow. I'm going to re-run it anyway, so hopefully it will work next time (if not then let's fix it)
Clearly something is going astray with pynndescent for Glove-100. I'll try to take a look and see if I reproduce the issue locally.
This line seems suspect! But maybe it's intentional? https://github.com/erikbern/ann-benchmarks/blob/main/ann_benchmarks/algorithms/pynndescent.py#L33
I believe it was intended to be deliberate but it seems like some handling of that in pynndescent has gone astray. That does look like the source of the issue though. I'll try to get something fixed imminently. In the meantime swapping back to cosine should work and just be a little slower.
EDIT: Found and fixed the problem (needed to ensure normalized versions of the data showed up in the right place in the copy_on_normalize scenario). The pynndescent 0.5.9 on PyPI should contain the fix. In principle that should just mean a rebuild of the docker container for pynndescent should fix it for the benchmarking.
Ok, nice, sounds like the next run should work then!
Just realized r6i supports AVX-512, which has float16 support. Seems Disk ANN takes advantage of that and ScaNN does too. Seems very useful for this kind of workload. I guess all libraries need to support it to be competitive.
Out of curiosity are the Jaccard benchmarks going to make it into this run? Since NMSlib Jaccard was added there should be a reasonable number of implementations for the Jaccard benchmark at last.
Yeah I'll re-run all the benchmarks. Haven't verified the Jaccard ones but hopefully they still work :)
Closing this in favor of #399
I just ran sift-128-euclidean and here are the results:
(edit: this chart is superseded by a later chart below)
I also ran glove-100-angular, but had a minor bug – see #362 – and I'm rerunning everything just to be on the safe side. Will post it later today.
Ran it on a r6i.16xlarge with 31 runs in parallel – was able to run everything in just about 8 hours.
Note that these aren't final – I want to give library maintainers a chance to fix any issues.
This plot is extremely crowded btw – any thoughts on algos we can remove? Or alternatively we could break it up into maybe "libraries" vs "servers" or something