Closed JiuqingDong closed 5 months ago
Thanks for your question. The original code of CLIP multiplies the value of the cosine similarity by 100 as a softmax temperature. In the field of OOD detection, it is important to remove 100 and set temparature to 1. This is shown in the MCM paper. So, I follow MCM.
I understand, but I didn't see a hyperparameter in MCM. So it makes me confused.
Yes, for my code, we pass logits_per_image = logit_scale * image_features @ text_features.t()
in https://github.com/AtsuMiyai/LoCoOp/blob/master/clip_w_local/model.py#L406.
So, we need to divide.
For MCM, they directly calculate the logits without logit_scale output = image_features @ text_features.T
in https://github.com/deeplearning-wisc/MCM/blob/640657ea67cb961045e0999301a6b8101dad65ba/utils/detection_util.py#L232C17-L232C58
so they don't need to divide.
Dear Author,
I have a question why do you divide output by 100?
Is the '100' a hyperparameter? Why do different scaling values yield different OOD performance?