Vision x Vision NOT what we want

zxyonaroll commented 1 year ago

As you can see above, I use the original assets(text, image, audio) in main branch, and find that Vision x Vision is not correct when dog_image x dog_image is not 1 while the other two is 1

aelnouby commented 1 year ago

Thanks for your question. Unlike other modalities, Vision logits are not scaled by a temperature: https://github.com/facebookresearch/ImageBind/blob/0f8620b6678fd24c35f172721ea6046ab5780890/models/imagebind_model.py#L432

If we look at the cosine similarity for Vision x Vision (so dropping the softmax), you can see the diagonal is exactly 1.0, which matches the expected behaviour.

tensor([[1.0000, 0.3682, 0.4185],
        [0.3682, 1.0000, 0.3172],
        [0.4185, 0.3172, 1.0000]], device='cuda:0')

Please let us know if you have any questions.

zxyonaroll commented 1 year ago

Thanks for your question. Unlike other modalities, Vision logits are not scaled by a temperature:

https://github.com/facebookresearch/ImageBind/blob/0f8620b6678fd24c35f172721ea6046ab5780890/models/imagebind_model.py#L432

If we look at the cosine similarity for Vision x Vision (so dropping the softmax), you can see the diagonal is exactly 1.0, which matches the expected behaviour.
tensor([[1.0000, 0.3682, 0.4185],
        [0.3682, 1.0000, 0.3172],
        [0.4185, 0.3172, 1.0000]], device='cuda:0')
Please let us know if you have any questions.

So when to use softmax and when to use cosine? Is there a uniform measurement standards, which I think is the original mind of this large model? One Embedding Space To Bind Them All, So I think there is one uniform output standard. How do you think, thank you very much.

bakachan19 commented 1 year ago

So when to use softmax and when to use cosine? I am also trying to understand the above discussion. If I want to find the most similar image to a given image, what should I use and how? Thanks.

facebookresearch / ImageBind

Vision x Vision NOT what we want #19