amin3141 commented 3 years ago

Summary

In certain domains, such as text retrieval, a hybrid vector space is very desirable, consisting of both a dense and sparse components. Ideally, a system should efficiently perform a nearest neighbor search across the entire vector, including both its dense and sparse parts.

Sparse here means on the order of 10k-100k dimensions.

I understand that FAISS is meant for dense lookups, but, on the other hand, recent research has focused on unifying sparse and dense components "under a single roof". For example, see Efficient inner Product Approximation in Hybrid Spaces by Wu et al.

I'm filing this issue to see whether this topic has been considered by the FAISS authors? What is the conclusion? Is it out-of-scope to support hybrid vectors within FAISS? If it is, can you recommend an architecture that incorporates FAISS for dense lookup while efficiently incorporating the sparse components of the vector?

mdouze commented 3 years ago

Yes it is out-of-scope for the foreseeable time.

amin3141 commented 3 years ago

Thanks for clarifying. If you have thoughts on how to efficiently implement a sparse-dense vector space using FAISS as a building block, I would be very interested to hear that, but otherwise, I have nothing else to add.

Jbollenbacher commented 6 months ago

Any interest in revisiting this? With retrieval augmented generation (RAG) systems exploding in popularity, hybrid search is becoming more in demand. I used FAISS on a RAG system previously, but now I may need to migrate to something else because FAISS lacks hybrid search. I'd love to see hybrid search in FAISS

mdouze commented 6 months ago

The work on RAG that I am aware of uses only dense search, but I am not an expert. @pemazare ? There is one example of sparse-dense manipulation with Faiss, for clustering. But it is hand-made: https://github.com/facebookresearch/faiss/wiki/Case-studies#mixed-sparse-dense-clustering

amin3141 commented 6 months ago

I'd like to add some context. Research into neural retrieval systems has split into three categories: dense embeddings, sparse embeddings, and late interaction.

Representative dense embedding encoders: USE-QA, Boomerang, Microsoft E5.

Sparse encoders: SPLADE v2 produce vectors similar to BM25, but with better retrieval performance and tunable sparsity.

Late interaction systems, which model the query and document as matrices, not vectors: ColBERT v2.

It's understood now that sparse and dense vectors capture different aspects of a document's information content. Ideally, the retrieval engine should perform an inner product similarity search in the hybrid space, consisting of the concatenation of the sparse and dense vector representations.

The problem is that this doesn't scale well. Thus, most systems implementing hybrid search work by separately retrieving top-N results from a sparse and dense index, and then combining them using a variety of techniques (e.g. reciprocal rank fusion, linear interpolation). For more info An Analysis of Fusion Functions for Hybrid Retrieval is quite thorough.

Beyond the work I previously linked to by Wu et al. I'm not aware of any other work that attempts inner product search in the hybrid space. If the community had a performant solution, it would be a boon to retrieval systems (both in the context of RAG and independently).

facebookresearch / faiss

Support for hybrid sparse-dense vector spaces #1922

Summary