AnswerDotAI / RAGatouille

Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
Apache License 2.0
3.03k stars 206 forks source link

How to check the centroids and the data in the clusters? #207

Closed ravirajag closed 2 months ago

ravirajag commented 6 months ago

I have indexed around 11k sentences and it created some 4000 centroids. I am able to load the centroids file using the code

from colbert.indexing.codecs.residual import ResidualCodec
res_codec = ResidualCodec.load(index_path)

I want to see what these 4000 centroids are (sentences). How should I get that? I want to see what data goes under each cluster here.

bclavie commented 2 months ago

Hey, this is out of scope for RAGatouille, as it's very much a ColBERT-related research undertaking. I'd advise looking at the main ColBERT repo. If you're interested, there is also some research papers trying to better understand how PLAID indexing works!