Support for incremental training

madisonmay commented 7 years ago

Which of the Index types support incremental addition of training data (i.e., in batches of 1k vectors)? The distinction between train and add is an additional source of confusion for me, especially because the tutorial makes it seem like the training operation should come first for Index types that support it? I'm used to an incremental addition of data flow, followed by a single training operation.

jegou commented 7 years ago

@madisonmay : most indexing algorithms first learn the structure (for those that require training), and based on these parameters, which essentially try to capture information about the distribution, you can populate and search the index. There are some exceptions, yet few indexing algorithms actually support online training in batches as you mention, in particular because the partitioning of the feature space involved in indexing algorithms should be fixed to avoid drifting (consistency of the space assignment of new queries versus stored vectors). For online search, the typical strategy is to re-train and re-construct periodically the index, or to create new indexes over time.

Do you have a particular algorithm in mind?

jegou commented 7 years ago

Additional comment: some algorithms require that train=test, i.e., you must use the stored vectors for the training as well. Training on the vectors to be indexed gives a better performance for all algorithms, including those for which this is not a requirement (like in Faiss). Algorithms for which it is a requirement are less flexible, in particular they are not adapted to the situation where you need to add vectors on-the-fly.

madisonmay commented 7 years ago

Thanks @jegou, your response was quite helpful, the flow makes quite a bit more sense now. Reading over https://github.com/facebookresearch/faiss/blob/master/benchs/bench_polysemous_1bn.py was also quite helpful. I suppose I was thinking of the add operation as stashing away vectors for later inclusion in the training step, rather than as adding vectors to the set of vectors potentially returned by a search operation.

madisonmay commented 7 years ago

Except few exception, little indexing algorithms actually support online training in batches as you mention

What are the exceptions to this rule?

I'm looking for an index type that's a decent tradeoff between scalability and precision/recall. We're looking to index roughly 10M vectors. We'll be receiving new vectors incrementally (a few thousand per day).

Am I correct in saying that the training vector set (xt) must fit in memory, although the set of searchable vectors (xb, added by the add operation) must not necessarily?

My current plan (given my current understanding of functionality) is to do the following:

Use IndexIVFPQ in conjunction with IndexFlatL2 as a quantizer.
Run an initial training operation on a subset of the data I have currently.
Continuously add new documents to the IVFPQ index as they come in.
Store vectors in a secondary location for retraining on a regular schedule (weekly, monthly, etc.).
When retraining, grab the largest possible random subset of the data that I can train in memory, and then reindex all of the existing vectors.

Does this sound like a reasonable approach, or are there still flaws in my mental model of how Faiss is intended to work?

madisonmay commented 7 years ago

@jegou, when you get a chance. Thanks again for all your help, I appreciate you all putting in the time.

jegou commented 7 years ago

@madisonmay , yes, training must fit in memory, which is not a requirement for indexed vectors. 10M vectors is a relatively small dataset. Brute-force (IndexFLAT or its GPU counterpart) obviously don't need any training, as well as all data-agnostic algorithms.

Your pipeline sounds reasonable, although for this small scale you may consider the following questions, since they would affect the choice of the index:

how much memory can you devote to indexing?
what is the dimensionality of your vectors?
what kind of accuracy/speed trade-offs are you looking for?

facebookresearch / faiss

Support for incremental training #44