A question about the get_clip_score_from_feature function

The similarity of image_features and text_features in this function is calculated through this place: similarity = (100.0 * (image_features/image_nor) @ (text_features/nor).T).softmax(dim=-1). if Taking it as a loss, isn't it expected that image_features and text_features are as orthogonal as possible? But should the expectation be that image_features and text_features are as similar as possible? I hope to get your answer, thank you very much!

ZhexinLiang / CLIP-LIT

A question about the get_clip_score_from_feature function #20