Open zxyonaroll opened 1 year ago
Thanks for your question. Unlike other modalities, Vision logits are not scaled by a temperature: https://github.com/facebookresearch/ImageBind/blob/0f8620b6678fd24c35f172721ea6046ab5780890/models/imagebind_model.py#L432
If we look at the cosine similarity for Vision x Vision (so dropping the softmax), you can see the diagonal is exactly 1.0, which matches the expected behaviour.
tensor([[1.0000, 0.3682, 0.4185],
[0.3682, 1.0000, 0.3172],
[0.4185, 0.3172, 1.0000]], device='cuda:0')
Please let us know if you have any questions.
Thanks for your question. Unlike other modalities, Vision logits are not scaled by a temperature:
If we look at the cosine similarity for Vision x Vision (so dropping the softmax), you can see the diagonal is exactly 1.0, which matches the expected behaviour.
tensor([[1.0000, 0.3682, 0.4185], [0.3682, 1.0000, 0.3172], [0.4185, 0.3172, 1.0000]], device='cuda:0')
Please let us know if you have any questions.
So when to use softmax and when to use cosine? Is there a uniform measurement standards, which I think is the original mind of this large model? One Embedding Space To Bind Them All, So I think there is one uniform output standard. How do you think, thank you very much.
So when to use softmax and when to use cosine?
I am also trying to understand the above discussion.
If I want to find the most similar image to a given image, what should I use and how?
Thanks.
As you can see above, I use the original assets(text, image, audio) in main branch, and find that Vision x Vision is not correct when dog_image x dog_image is not 1 while the other two is 1