facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
29.42k stars 3.48k forks source link

[Feature Request]Vector deduplication #3087

Open heemin32 opened 9 months ago

heemin32 commented 9 months ago

Summary

I would like to see vector deduplication support in Faiss.

There will be a set of vectors which are under a same group. During KNN search, I would like to get k nearest vectors for each group but not more than one from the same group.

For example, let's say a user have a set of large documents and it wants to search the document using vector representation of the document. Because the document is big, user might need to split the original document(parent document) into smaller documents(child document) and generate vector for each smaller document using ml model. The vector data of child document will be indexed and searched using faiss library.

The problem is that, because search is happening in the vector date of child documents, it does not guarantee to return k vector of child documents with distinct parent document. Ideally, it should dedupe the search result per parent document and return k top vector of child document of which parent document is all distinct.

We could extend https://github.com/facebookresearch/faiss/blob/main/faiss/impl/IDSelector.h or add new IDDeduper to deduplicate the results.

mdouze commented 9 months ago

That's an interesting database functionality. I don't see immediately how this could be supported in Faiss, but let's keep it in mind.

navneet1v commented 8 months ago

@mdouze I see @heemin32 has updated the proposal, I think this will be a good feature to go in side by side of IDSelector Interface or even extending that interface.

This will provide the a better way to get top K similar results, if vectors are related in some fashion. Please provide your thoughts.

heemin32 commented 7 months ago

@mdouze Any further thought on this? As @navneet1v mentioned, we can add something like IDDeduper(or IDGroup) and pass it through search parameter. During the result collection, we can dedupe the ID and only store the nearest ID/Distance.

mdouze commented 7 months ago

AFAICS this is of broad enough interest to put into main Faiss. You have to realize that adding a new result filtering criterion means touching ~20 index implementations, at least to raise an error signaling it is not supported. Workarounds are:

heemin32 commented 7 months ago

I think we can support the feature incrementally starting hnsw index. For all other index, we can throw an error message like what we do for search parameter. FAISS_THROW_IF_NOT_MSG(!params, "search params not supported for this index");

navneet1v commented 7 months ago

@mdouze The workarounds provided won't work well.

Thanks for pointing this, I agree that we might need to touch more than ~20 index implementation, but as @heemin32 pointed out we will start with HNSW and for all other we will make sure that we are sending errors. Apart from this will there be anything that we are missing?

mdouze commented 7 months ago

Sorry for my message above, I meant this is not of broad enough interest to put in Faiss.

navneet1v commented 7 months ago

@mdouze is there any faiss mailing list on which we can send out this feature and see if there are users interested in this feature?

Reason why I saying this is because the use case mentioned in the description is very popular use case in today's world with RAG/Semantic Search use cases. Given that embeddings models cannot create embedding for large documents in 1 call, so we need to create M embeddings for 1 document by splitting the larger document in smaller chunks. Similarly while doing search we want to make sure that chunks of same documents are not returned back as K results.

Please let us know your thoughts.

heemin32 commented 7 months ago

@mdouze I raised a draft PR for vector deduplication in HNSW. https://github.com/facebookresearch/faiss/pull/3140 I am going to continue to work to make the PR to meet the bar to be merged in the main branch. Could you take a look and provide a feedback? Any objection on continuing the work?

heemin32 commented 7 months ago

@mdouze I added a new PR which introduce a result collector so that caller can implement deduplication logic outside of faiss repo. Could you take a look and provide a comment? https://github.com/facebookresearch/faiss/pull/3161

heemin32 commented 5 months ago

After the refactoring around ResultHandler, https://github.com/facebookresearch/faiss/pull/3190, I came up with new plan to support grouping of result.

  1. Create IDGrouper similar to IDSelector. Add two grouper to start with: Bitmap, SparseBitmap
  2. Add IDGrouper to SearchParameter
  3. Create GroupedHeapBlockResultHandler similar to HeapBlockResultHandler. This class will have two additional data: an array of group id corresponding to an array of ids, a map from group id to index in an array of ids
  4. Implement a method to handle heap with an array of group id to be used inside GroupedHeapBlockResultHandler
  5. Modify search method to switch result handler if search parameter has IDGrouper