apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

[DISCUSS] Could we have a different ANN algorithm for Learned Sparse Vectors? #13675

Open kaivalnp opened 3 weeks ago

kaivalnp commented 3 weeks ago

Description

Learned Sparse Vectors claim to combine the benefits of sparse (i.e. lexical) and dense (i.e. vector) representations

From https://en.wikipedia.org/wiki/Learned_sparse_retrieval:

Learned sparse retrieval or sparse neural search is an approach to text search which uses a sparse vector representation of queries and documents.[1] It borrows techniques both from lexical bag-of-words and vector embedding algorithms, and is claimed to perform better than either alone. The best-known sparse neural search systems are SPLADE[2] and its successor SPLADE v2

From https://zilliz.com/learn/enhancing-information-retrieval-learned-sparse-embeddings:

Learn sparse embeddings denote sparse vector representations of data crafted through sophisticated machine learning models such as SPLADE and BGE-M3. Unlike traditional sparse vectors, which rely solely on statistical methods like BM25, learned sparse embeddings enrich the sparse representation with contextual information while retaining keyword search capabilities. They can discern the significance of adjacent or correlated tokens, even if not explicitly present in the text, resulting in a "learned" sparse representation adept at capturing relevant keywords and classes.

A famous model for such sparse representations of documents is SPLADE (https://github.com/naver/splade)

Paper

Came across Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations (https://arxiv.org/pdf/2404.18812) which shows promising benchmarks for KNN search over learned sparse vectors:

Experimental results show average per-query latency in microsecond territory on various sparse embeddings of Ms Marco. Impressively, Seismic outperforms the graph-based winning solutions of the BigANN Challenge by a factor of at least 3.4 at 95% accuracy on Splade and 12 on Efficient Splade, with the margin widening substantially as accuracy increases. Other baselines, including state-of-the-art inverted index-based algorithms, are consistently one to two orders of magnitude slower than Seismic

Figure 3 in the linked paper summarizes the design of the algorithm:

The design of Seismic. Inverted lists are independently partitioned into geometrically-cohesive blocks. Each block is a set of document identifiers with a summary vector. The inner product of a query with the summary approximates the inner product attainable with the documents in that block. The forward index stores the complete vectors (including values).

Learned Sparse Vectors seem to be naturally compatible with inverted indexes, and many aspects of the algorithm are already implemented in Lucene Could we use this for faster KNN search when sparse vectors are used?

msokolov commented 3 weeks ago

What I wonder is: how can Lucene help with this? I feel like we have all the primitives available to enable Splade-style search and retrieval, but maybe there is something missing? IIRC there needs to be a per-term score, but I think we do have the ability to store custom term frequencies and to override similarity in order to do the appropriate combination of these term scores, so I think those are the ingredients needed for this. Maybe it's a case of trying it and seeing if there are some features it would be helpful to embed in the Lucene layer, or if indeed we can build this "on top"?

benwtrent commented 3 weeks ago

There might be a better format than just terms. But I would assume the bipartite graph stuff would help here.

Additionally, I would expect the most benefits to be made at query time. Looking at the newer research out of the splade folks, they are optimizing the query time by adjusting the scoring, not really the storage.

Maybe a better query would be a good first step.

jpountz commented 3 weeks ago

I found this recent paper by well-known people in the IR efficiency space quite interesting: https://arxiv.org/pdf/2405.01117. It builds on inverted indexes and simple/intuitive ideas: