Open heemin32 opened 9 months ago
That's an interesting database functionality. I don't see immediately how this could be supported in Faiss, but let's keep it in mind.
@mdouze I see @heemin32 has updated the proposal, I think this will be a good feature to go in side by side of IDSelector Interface or even extending that interface.
This will provide the a better way to get top K similar results, if vectors are related in some fashion. Please provide your thoughts.
@mdouze Any further thought on this? As @navneet1v mentioned, we can add something like IDDeduper(or IDGroup) and pass it through search parameter. During the result collection, we can dedupe the ID and only store the nearest ID/Distance.
AFAICS this is of broad enough interest to put into main Faiss. You have to realize that adding a new result filtering criterion means touching ~20 index implementations, at least to raise an error signaling it is not supported. Workarounds are:
I think we can support the feature incrementally starting hnsw index. For all other index, we can throw an error message like what we do for search parameter. FAISS_THROW_IF_NOT_MSG(!params, "search params not supported for this index");
@mdouze The workarounds provided won't work well.
Thanks for pointing this, I agree that we might need to touch more than ~20 index implementation, but as @heemin32 pointed out we will start with HNSW and for all other we will make sure that we are sending errors. Apart from this will there be anything that we are missing?
Sorry for my message above, I meant this is not of broad enough interest to put in Faiss.
@mdouze is there any faiss mailing list on which we can send out this feature and see if there are users interested in this feature?
Reason why I saying this is because the use case mentioned in the description is very popular use case in today's world with RAG/Semantic Search use cases. Given that embeddings models cannot create embedding for large documents in 1 call, so we need to create M embeddings for 1 document by splitting the larger document in smaller chunks. Similarly while doing search we want to make sure that chunks of same documents are not returned back as K
results.
Please let us know your thoughts.
@mdouze I raised a draft PR for vector deduplication in HNSW. https://github.com/facebookresearch/faiss/pull/3140 I am going to continue to work to make the PR to meet the bar to be merged in the main branch. Could you take a look and provide a feedback? Any objection on continuing the work?
@mdouze I added a new PR which introduce a result collector so that caller can implement deduplication logic outside of faiss repo. Could you take a look and provide a comment? https://github.com/facebookresearch/faiss/pull/3161
After the refactoring around ResultHandler, https://github.com/facebookresearch/faiss/pull/3190, I came up with new plan to support grouping of result.
Summary
I would like to see vector deduplication support in Faiss.
There will be a set of vectors which are under a same group. During KNN search, I would like to get k nearest vectors for each group but not more than one from the same group.
For example, let's say a user have a set of large documents and it wants to search the document using vector representation of the document. Because the document is big, user might need to split the original document(parent document) into smaller documents(child document) and generate vector for each smaller document using ml model. The vector data of child document will be indexed and searched using faiss library.
The problem is that, because search is happening in the vector date of child documents, it does not guarantee to return k vector of child documents with distinct parent document. Ideally, it should dedupe the search result per parent document and return k top vector of child document of which parent document is all distinct.
We could extend https://github.com/facebookresearch/faiss/blob/main/faiss/impl/IDSelector.h or add new IDDeduper to deduplicate the results.