Closed madisonmay closed 7 years ago
@madisonmay : most indexing algorithms first learn the structure (for those that require training), and based on these parameters, which essentially try to capture information about the distribution, you can populate and search the index. There are some exceptions, yet few indexing algorithms actually support online training in batches as you mention, in particular because the partitioning of the feature space involved in indexing algorithms should be fixed to avoid drifting (consistency of the space assignment of new queries versus stored vectors). For online search, the typical strategy is to re-train and re-construct periodically the index, or to create new indexes over time.
Do you have a particular algorithm in mind?
Additional comment: some algorithms require that train=test, i.e., you must use the stored vectors for the training as well. Training on the vectors to be indexed gives a better performance for all algorithms, including those for which this is not a requirement (like in Faiss). Algorithms for which it is a requirement are less flexible, in particular they are not adapted to the situation where you need to add vectors on-the-fly.
Thanks @jegou, your response was quite helpful, the flow makes quite a bit more sense now. Reading over https://github.com/facebookresearch/faiss/blob/master/benchs/bench_polysemous_1bn.py was also quite helpful. I suppose I was thinking of the add operation as stashing away vectors for later inclusion in the training step, rather than as adding vectors to the set of vectors potentially returned by a search operation.
Except few exception, little indexing algorithms actually support online training in batches as you mention
What are the exceptions to this rule?
I'm looking for an index type that's a decent tradeoff between scalability and precision/recall. We're looking to index roughly 10M vectors. We'll be receiving new vectors incrementally (a few thousand per day).
Am I correct in saying that the training vector set (xt) must fit in memory, although the set of searchable vectors (xb, added by the add
operation) must not necessarily?
My current plan (given my current understanding of functionality) is to do the following:
When retraining, grab the largest possible random subset of the data that I can train in memory, and then reindex all of the existing vectors.
Does this sound like a reasonable approach, or are there still flaws in my mental model of how Faiss is intended to work?
@jegou, when you get a chance. Thanks again for all your help, I appreciate you all putting in the time.
@madisonmay , yes, training must fit in memory, which is not a requirement for indexed vectors. 10M vectors is a relatively small dataset. Brute-force (IndexFLAT or its GPU counterpart) obviously don't need any training, as well as all data-agnostic algorithms.
Your pipeline sounds reasonable, although for this small scale you may consider the following questions, since they would affect the choice of the index:
Which of the Index types support incremental addition of training data (i.e., in batches of 1k vectors)? The distinction between train and add is an additional source of confusion for me, especially because the tutorial makes it seem like the training operation should come first for Index types that support it? I'm used to an incremental addition of data flow, followed by a single training operation.