Closed thomasahle closed 1 year ago
@erikbern I want to make sure building the different parametrized indices don't take too long.
Basically my fit/build consist of two tasks: (1) Running k-means on the dataset. (2) Adding each point to the m
nearest clusters for some m
.
The first task takes a long time 10-60 min depending on the number of cluster points. The second task is quite fast. 1-3min.
The issue is, that with the current args vs query-args system, there's no way for me to cache the k-means clustering.
This means that when trying different values of m
, the entire k-means process gets run from scratch.
Is there a way I can set up the configuration to prevent this problem? That would allow me to try more different m
values.
The ideal setting seems to be skew in [1, 4] and build_probes in [1, 9]. I could rewrite this pull-request using arg-groups, to just try one value for build-probes per skew. But if there's a way to cache the k-means, so I only have to do it 4 times, then I could try the whole product [1, 4] x [1, 9].
Seems a bit complex to refactor the benchmarks to support what you want to do – hopefully you can fit it into the current system!
Sure, I've updated it with some more fine grained arg-groups. Hopefully they don't time out.
I talked to Matthijs Douze, and he recommended using a number of clusters that's slightly larger than sqrt(N). This commit just adds three build arguments to do that.