LAION-AI / laion-dedup

12 stars 2 forks source link

Computing hashes from embeddings #1

Open rom1504 opened 1 year ago

rom1504 commented 1 year ago

https://github.com/facebookresearch/faiss/issues/2531#issuecomment-1280695975 some thoughts here

https://docs.google.com/document/d/1AryWpV0dD_r9x82I_quUzBuRyzDotL_HHnKuNB9H3Zc/edit?usp=drivesdk more thoughts there

Also this https://github.com/LAION-AI/project-menu/issues/28

rom1504 commented 1 year ago

Trying to find a function f such that for an embedding e of size let's say 1024 floats i have for most e1, e2 in the space of my embeddings : h1 = f(e1) h2 = f(e2) Such that if sim(e1, e2) > threshold then h1 = h2 (and if sim(e1,e2) < threshold then h1 != h2) With h preferably encoded as a small amount of bytes

One use case of such a function f would be to perform efficient deduplication of items represented by embeddings.

I think it would be possible to directly train a neural net to be f. But I'm wondering if using the quantization techniques implemented in faiss could be also a good technique.

Maybe the encodings produced by IndexLSH could work. Maybe ones produced by PQ index could be helpful too.

rom1504 commented 1 year ago

On the trained network path:

We could probably generate a bunch of positive and negative by using the existing faiss index. Then use that to train f with triple loss/ a loss like clip.

f would need to quantize the embedding into a small amount of = comparable bytes Maybe needed to use something like Gumbel softmax, not sure

rom1504 commented 1 year ago

But it seems to me the quantization performed in faiss are quite similar to what we want here. But might not be fully optimizable towards the right task Maybe https://github.com/lucidrains/vector-quantize-pytorch

mehdidc commented 1 year ago

@rom1504 related to what you say, on the trained network path, we might want to look at what people do in deep hashing. e.g. DistillHash (https://openaccess.thecvf.com/content_CVPR_2019/papers/Yang_DistillHash_Unsupervised_Deep_Hashing_by_Distilling_Data_Pairs_CVPR_2019_paper.pdf) seems to be relevant, it specifically deals with the case where we can sample positive/negative pairs following some pre-defined criterion, and it learns a hash function preserving the pairs relationship.

rom1504 commented 1 year ago

https://www.algolia.com/blog/ai/vectors-vs-hashes/ learn binary hashes with a nn Only a speed trick and doesn't change things compared to embeddings here but interesting