erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.73k stars 715 forks source link

Fix logic in _get_algorithm_definitions to avoid skipping algorithm definitions #498

Closed alexklibisz closed 3 months ago

alexklibisz commented 3 months ago

Maybe I'm missing something, but it seems like the logic in _get_algorithm_definitions leads to incorrectly skipping algorithm definitions, which I've attempted to fix here.

For example, elastiknn has definitions for the "point types" any and euclidean: https://github.com/erikbern/ann-benchmarks/blob/main/ann_benchmarks/algorithms/elastiknn/config.yml

But, if I run python run.py --algorithm elastiknn-l2lsh --dataset random-xs-20-euclidean --run-disabled --timeout 30 --local --force --runs 1, I get the "Nothing to run" exception. That doesn't make sense IMO. Elastiknn has definitions for the euclidean point type, so there is not "nothing to run".

It seems that the non-any point type is skipped because of the logic in _get_algorithm_definitions. If an algorithm has definitions for any, they take precedence over the definitions for a specific point type (euclidean). We can fix this by changing the logic so that it accumulates all matching point types, rather than just taking the any type and skipping the rest. In other words, we change the elif to a second if.

maumueller commented 3 months ago

Interesting, this seems to have been broken for a long time (and meant that we excluded many of the implementations of the nmslib library.) Thanks for the fix, @alexklibisz!

I sampled a few implementations and only pynndescent has a somewhat strange structure for euclidean/angular/any. @lmcinnes Could you check if the any entry of https://github.com/erikbern/ann-benchmarks/blob/c4155055ee45a0dc46ee5bf1a90f6fbde927c50d/ann_benchmarks/algorithms/pynndescent/config.yml is useful?

lmcinnes commented 3 months ago

I think the any case is a "fallback" option in case the other matches didn't work out, so it works as a "I don't know what else to do; try this" approach, but if that is not how any is being used then perhaps we just remove the any option for pynndescent? How is any intended to work?

maumueller commented 3 months ago

Thanks for the quick reply! As it seems, any took precedence over all other configurations, so it might be that all your pynndescent runs were using these parameter settings.

With the fix from @alexklibisz, it will now merge the any/euclidean and any/angular configurations depending on the dataset.

lmcinnes commented 3 months ago

I think removing the any option for pynndescent is probably the best option then. Those aren't really optimal parameters for anything, just a reasonable in-between choice to cover possibilities. Best to rely on the specific values for the individual metric types.