The similarity of image_features and text_features in this function is calculated through this place:
similarity = (100.0 * (image_features/image_nor) @ (text_features/nor).T).softmax(dim=-1).
if Taking it as a loss, isn't it expected that image_features and text_features are as orthogonal as possible? But should the expectation be that image_features and text_features are as similar as possible?
I hope to get your answer, thank you very much!
The similarity of image_features and text_features in this function is calculated through this place: similarity = (100.0 * (image_features/image_nor) @ (text_features/nor).T).softmax(dim=-1). if Taking it as a loss, isn't it expected that image_features and text_features are as orthogonal as possible? But should the expectation be that image_features and text_features are as similar as possible? I hope to get your answer, thank you very much!