erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.74k stars 718 forks source link

Why is `lastfm-64-dot` called `-dot`? #400

Open thomasahle opened 1 year ago

thomasahle commented 1 year ago

From the path http://ann-benchmarks.com/lastfm-64-dot_10_angular.html it seems that this dataset is actually angular. But the name indicates dot-product, which many of the algorithms don't natively support.

erikbern commented 1 year ago

yeah this is very confusing – I think it's a mistake. https://github.com/erikbern/ann-benchmarks/blob/main/ann_benchmarks/datasets.py#L427 indicates it's angular (cosine) distance too.

Maybe let's remove this dataset from the benchmarks for now.

maumueller commented 1 year ago

@benfred should be able to shed some light on this.

benfred commented 1 year ago

The original intent was to test out inner-product distance (dot), not angular distance: https://github.com/erikbern/ann-benchmarks/pull/91 .

IIRC, the rationale was that certain algorithms either didn't support IP distance - or didn't have good performance when applying transforms like https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/XboxInnerProduct.pdf to convert IP distance to a cosine space

erikbern commented 1 year ago

I think it's nice to have a dataset for dot products. But I'll fix that after I'm done with this run.