Open rom1504 opened 2 years ago
Trying to find a function f such that for an embedding e of size let's say 1024 floats i have for most e1, e2 in the space of my embeddings : h1 = f(e1) h2 = f(e2) Such that if sim(e1, e2) > threshold then h1 = h2 (and if sim(e1,e2) < threshold then h1 != h2) With h preferably encoded as a small amount of bytes
One use case of such a function f would be to perform efficient deduplication of items represented by embeddings.
I think it would be possible to directly train a neural net to be f. But I'm wondering if using the quantization techniques implemented in faiss could be also a good technique.
Maybe the encodings produced by IndexLSH could work. Maybe ones produced by PQ index could be helpful too.
On the trained network path:
We could probably generate a bunch of positive and negative by using the existing faiss index. Then use that to train f with triple loss/ a loss like clip.
f would need to quantize the embedding into a small amount of = comparable bytes Maybe needed to use something like Gumbel softmax, not sure
But it seems to me the quantization performed in faiss are quite similar to what we want here. But might not be fully optimizable towards the right task Maybe https://github.com/lucidrains/vector-quantize-pytorch
@rom1504 related to what you say, on the trained network path, we might want to look at what people do in deep hashing. e.g. DistillHash (https://openaccess.thecvf.com/content_CVPR_2019/papers/Yang_DistillHash_Unsupervised_Deep_Hashing_by_Distilling_Data_Pairs_CVPR_2019_paper.pdf) seems to be relevant, it specifically deals with the case where we can sample positive/negative pairs following some pre-defined criterion, and it learns a hash function preserving the pairs relationship.
https://www.algolia.com/blog/ai/vectors-vs-hashes/ learn binary hashes with a nn Only a speed trick and doesn't change things compared to embeddings here but interesting
https://github.com/facebookresearch/faiss/issues/2531#issuecomment-1280695975 some thoughts here
https://docs.google.com/document/d/1AryWpV0dD_r9x82I_quUzBuRyzDotL_HHnKuNB9H3Zc/edit?usp=drivesdk more thoughts there
Also this https://github.com/LAION-AI/project-menu/issues/28