As a user, I want to understand why certain documents are returned so that we can formulate the LLM context better.
One of the benefits of late interaction is having per-token scores that can help us explain why a certain document is returned. This also means that we can calculate a "highlight" span that is most similar to the query.
Note: This requires us to know what token is stored in the index and hydrate it for the results. We could couple this with the model's vocabulary and store the vocab id or we can store the raw token.
As a user, I want to understand why certain documents are returned so that we can formulate the LLM context better.
One of the benefits of late interaction is having per-token scores that can help us explain why a certain document is returned. This also means that we can calculate a "highlight" span that is most similar to the query.
See https://blog.vespa.ai/announcing-colbert-embedder-in-vespa/ for Vespa's explainer.
Acceptance Criteria
Note: This requires us to know what token is stored in the index and hydrate it for the results. We could couple this with the model's vocabulary and store the vocab id or we can store the raw token.