Open kaivalnp opened 3 months ago
What I wonder is: how can Lucene help with this? I feel like we have all the primitives available to enable Splade-style search and retrieval, but maybe there is something missing? IIRC there needs to be a per-term score, but I think we do have the ability to store custom term frequencies and to override similarity in order to do the appropriate combination of these term scores, so I think those are the ingredients needed for this. Maybe it's a case of trying it and seeing if there are some features it would be helpful to embed in the Lucene layer, or if indeed we can build this "on top"?
There might be a better format than just terms. But I would assume the bipartite graph stuff would help here.
Additionally, I would expect the most benefits to be made at query time. Looking at the newer research out of the splade folks, they are optimizing the query time by adjusting the scoring, not really the storage.
Maybe a better query would be a good first step.
I found this recent paper by well-known people in the IR efficiency space quite interesting: https://arxiv.org/pdf/2405.01117. It builds on inverted indexes and simple/intuitive ideas:
I have recently been interested in this direction and plan on spending non trivial amount of time on this over the next few weeks. Assuming we haven't started dev on this, I am assigning it to myself.
What I have been hacking on is impact sorted prioritisation of docIDs during the ranking phase (especially for custom rankers), so that would probably be the first thing that comes out of this ticket.
Description
Learned Sparse Vectors claim to combine the benefits of sparse (i.e. lexical) and dense (i.e. vector) representations
From https://en.wikipedia.org/wiki/Learned_sparse_retrieval:
From https://zilliz.com/learn/enhancing-information-retrieval-learned-sparse-embeddings:
A famous model for such sparse representations of documents is SPLADE (https://github.com/naver/splade)
Paper
Came across Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations (https://arxiv.org/pdf/2404.18812) which shows promising benchmarks for KNN search over learned sparse vectors:
Figure 3 in the linked paper summarizes the design of the algorithm:
Learned Sparse Vectors seem to be naturally compatible with inverted indexes, and many aspects of the algorithm are already implemented in Lucene Could we use this for faster KNN search when sparse vectors are used?