Add support for 'quantize_embeddings'

AnswerDotAI / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.

Apache License 2.0

3.08k stars 210 forks source link

Hey!

Thank you for opening the issue, there's definitely been a lot of interest in quantised embeddings.

There's potentially work on future ColBERT models for better compression, or even better indexing methods like the very recently released EMVB paper (which would be really nice to get into RAGatouille!) whose memory footprint is 1.8x smaller than the current PLAID indexing method.

However, for the general idea of quantisation -- it's actually already used by the PLAID index! Right now, we default to this compression logic to choose how aggressive the quantisation will be:

nbits = 2
if len(collection) < 5000:
    nbits = 8
elif len(collection) < 10000:
    nbits = 4

but I think it'd make sense to expose this parameter to users, so you could even choose nbits=1 to quantise down as much as possible!

AnswerDotAI / RAGatouille

Add support for 'quantize_embeddings' #181