facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.24k stars 3.63k forks source link

Support for incremental training #44

Closed madisonmay closed 7 years ago

madisonmay commented 7 years ago

Which of the Index types support incremental addition of training data (i.e., in batches of 1k vectors)? The distinction between train and add is an additional source of confusion for me, especially because the tutorial makes it seem like the training operation should come first for Index types that support it? I'm used to an incremental addition of data flow, followed by a single training operation.

jegou commented 7 years ago

@madisonmay : most indexing algorithms first learn the structure (for those that require training), and based on these parameters, which essentially try to capture information about the distribution, you can populate and search the index. There are some exceptions, yet few indexing algorithms actually support online training in batches as you mention, in particular because the partitioning of the feature space involved in indexing algorithms should be fixed to avoid drifting (consistency of the space assignment of new queries versus stored vectors). For online search, the typical strategy is to re-train and re-construct periodically the index, or to create new indexes over time.

Do you have a particular algorithm in mind?

jegou commented 7 years ago

Additional comment: some algorithms require that train=test, i.e., you must use the stored vectors for the training as well. Training on the vectors to be indexed gives a better performance for all algorithms, including those for which this is not a requirement (like in Faiss). Algorithms for which it is a requirement are less flexible, in particular they are not adapted to the situation where you need to add vectors on-the-fly.

madisonmay commented 7 years ago

Thanks @jegou, your response was quite helpful, the flow makes quite a bit more sense now. Reading over https://github.com/facebookresearch/faiss/blob/master/benchs/bench_polysemous_1bn.py was also quite helpful. I suppose I was thinking of the add operation as stashing away vectors for later inclusion in the training step, rather than as adding vectors to the set of vectors potentially returned by a search operation.

madisonmay commented 7 years ago

Except few exception, little indexing algorithms actually support online training in batches as you mention

What are the exceptions to this rule?

I'm looking for an index type that's a decent tradeoff between scalability and precision/recall. We're looking to index roughly 10M vectors. We'll be receiving new vectors incrementally (a few thousand per day).

Am I correct in saying that the training vector set (xt) must fit in memory, although the set of searchable vectors (xb, added by the add operation) must not necessarily?

My current plan (given my current understanding of functionality) is to do the following:

madisonmay commented 7 years ago

@jegou, when you get a chance. Thanks again for all your help, I appreciate you all putting in the time.

jegou commented 7 years ago

@madisonmay , yes, training must fit in memory, which is not a requirement for indexed vectors. 10M vectors is a relatively small dataset. Brute-force (IndexFLAT or its GPU counterpart) obviously don't need any training, as well as all data-agnostic algorithms.

Your pipeline sounds reasonable, although for this small scale you may consider the following questions, since they would affect the choice of the index: