erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.74k stars 718 forks source link

Added a skew factor to the number of clusters #391

Closed thomasahle closed 1 year ago

thomasahle commented 1 year ago

I talked to Matthijs Douze, and he recommended using a number of clusters that's slightly larger than sqrt(N). This commit just adds three build arguments to do that.

thomasahle commented 1 year ago

@erikbern I want to make sure building the different parametrized indices don't take too long. Basically my fit/build consist of two tasks: (1) Running k-means on the dataset. (2) Adding each point to the m nearest clusters for some m.

The first task takes a long time 10-60 min depending on the number of cluster points. The second task is quite fast. 1-3min.

The issue is, that with the current args vs query-args system, there's no way for me to cache the k-means clustering. This means that when trying different values of m, the entire k-means process gets run from scratch. Is there a way I can set up the configuration to prevent this problem? That would allow me to try more different m values.

thomasahle commented 1 year ago

The ideal setting seems to be skew in [1, 4] and build_probes in [1, 9]. I could rewrite this pull-request using arg-groups, to just try one value for build-probes per skew. But if there's a way to cache the k-means, so I only have to do it 4 times, then I could try the whole product [1, 4] x [1, 9].

Figure_1

erikbern commented 1 year ago

Seems a bit complex to refactor the benchmarks to support what you want to do – hopefully you can fit it into the current system!

thomasahle commented 1 year ago

Sure, I've updated it with some more fine grained arg-groups. Hopefully they don't time out.