Inconsistent search results length for high top-k values

ABCbum commented 8 months ago

Hi, I'm getting an issue similar to #130. The number of returned top-k isn't always as specified (i.e. k=500, len(res) = 4xx, 3xx), this is more the case for the fine-tuned version of colbert-v2.

Tho it's uncommon to have such a high top-k, this is helpful for me when benchmarking and making the function more predictable when used.

The dataset has 800 docs.
Model: fine-tuned colbert-ir/colbertv2.0.

Code:

from ragatouille import RAGPretrainedModel

# Indexing
RAG = RAGPretrainedModel.from_pretrained("/path/to/finetuned_model")
index_path = RAG.index(index_name="my_index", collection=docs, document_ids=doc_ids)

# Retrieving
RAG = RAGPretrainedModel.from_index('.ragatouille/colbert/indexes/finetuned_index')
results = RAG.search(query, k=500)
print(len(results))
# -> 500, 491, 413, 3xx, ....

Thanks for the help!

bclavie commented 8 months ago

Hey! This isn't a full solution to your problem (which is basically due to how the optimised retrieval engine works, and the defaults/dynamic hyper parameters not being very strong for small collections), but I think for just ~800 documents for benchmarking purposes you could alleviate this issue is by using in-memory encoding rather than indexing. (until I build a proper nice HNSW-style index, I'm also planning on letting users create an "index" by persisting their in-memory encoding, which will work really well for relatively low number of documents!)

E.g. in your situation, replace

RAG = RAGPretrainedModel.from_pretrained("/path/to/finetuned_model")
index_path = RAG.index(index_name="my_index", collection=docs, document_ids=doc_ids)

# Retrieving
RAG = RAGPretrainedModel.from_index('.ragatouille/colbert/indexes/finetuned_index')
results = RAG.search(query, k=500)

with

RAG = RAGPretrainedModel.from_pretrained("/path/to/finetuned_model")
RAG.encode(docs)
results = RAG.search_encoded_docs(query, k=500)

This will actively search through every single document rather than PLAID-style approximation, which for small datasets + high k values will always guarantee that you get the number of results you want, and the computational overhead is minimal at your data scale (on my machine, it takes ~45ms to query the index, and ~55 to query in-memory encoded docs)

ABCbum commented 8 months ago

Thanks, that works well! Small detail but I think it'd be nice to add document_ids to RAG.encode similar to how it's done with RAG.index so that both can have the same result format.

bclavie commented 8 months ago

Hey, this will come along with https://github.com/bclavie/RAGatouille/pull/137 (as well as making full-vectors indexing the default index for small collections)!

littletree-Y commented 5 months ago

Hey, this will come along with #137 (as well as making full-vectors indexing the default index for small collections)!

hi, i meet the same problem, could it support document_ids for RAG.encode now?

AnswerDotAI / RAGatouille

Inconsistent search results length for high top-k values #135