[ColBERT] idea: return top_n best scoring tokens

datastax / ragstack-ai

RAGStack is an out of the box solution simplifying Retrieval Augmented Generation (RAG) in AI apps.

https://www.datastax.com/products/ragstack

Other

112 stars 7 forks source link

[ColBERT] idea: return top_n best scoring tokens #488

Open fdurant opened 1 month ago

fdurant commented 1 month ago

I'm experimenting with RAGStack ColBERT and have a feature request.

In order to be able to produce a query-passage scoring interpretability visualization like this, it would be handy if the result of ColbertVectorStore.add_texts also included the top-n list of most contributing tokens, each with a normalized score that would be trivial to color-code in a UI. This could be achieved via an extra parameter include_token_scores: int = 0

cbornet commented 4 weeks ago

There is no scoring when adding documents. It only happens when retrieving. Chunk scores are part of ColbertRetriever::text_search. Does it answer your need ?

cbornet commented 4 weeks ago

Also I don't think we keep track of the tokens. Only their embeddings. And the Chunk score is the max of the embeddings scores which are not exposed either, @zzzming can you confirm ?

epinzur commented 4 weeks ago

@cbornet you are correct. We don't store the tokens for the text... only the embeddings of the tokens.