deeplearning-wisc / MCM

PyTorch implementation of MCM (Delving into out-of-distribution detection with vision-language representations), NeurIPS 2022
68 stars 8 forks source link

About the Mahalanobis distance method #11

Closed jimo17 closed 1 year ago

jimo17 commented 1 year ago

Hi, thank you for open sourcing. I want to reproduce the method of Mahalanobis distance in the paper. May I ask about the method of Mahalanobis distance, what is the default parameter in the args.feat_dim paper?

alvinmingsf commented 1 year ago

Hi! Thanks for your interest in our work.feat dim is the feature dimension of the visual encoder (512 for ViT-B and 768 for ViT-L, L39 in the latest ver of eval_ood_detection.py)

jimo17 commented 1 year ago

Hi, thank you very much for your answer. I have two more questions. The first question is that the parameter args.max_count of the Mahalanobis distance method in your code does not seem to be used. This seems to use the entire training set to estimate classwise mean and precision matrix. The second question is about the Energy-based method in the paper. Is the cosine similarity output by CLIP Model directly input into this line of code for processing? _score.append(-to_np((args.T*torch.logsumexp(output / args.T, dim=1)))) #energy score is expected to be smaller for ID The Energy-based approach I ran with this code didn't work out very well, and I'm still looking for the reason. I'm not sure if it's because I didn't choose the right temperature or something else.

alvinmingsf commented 1 year ago

Maha score: Yes, max_count is optional. It can be used to select a subset of training set to estimate mean and covariance for better computational efficiency when the training set size is large (e.g. ImageNet-1k, see L56 - 65 in utils/train_eval_util.py). It is set false as default. As our focus is on MCM score, we keep the implementation of Maha score simple just for reference. There are different implementations of the Maha score: e.g. whether to use the classwise covariance or population covariance matrix; whether the feature embedding is l2-normalized (https://github.com/inspire-group/SSD), etc.

Energy score: "Is the cosine similarity output by CLIP Model directly input into this line of code for processing?" Yes, and when applying energy score to the multi-modal cosine similarity, and it does not work well in our prior experiments (see Appendix F, Alternative Scoring Functions). I also tried different temperatures, the energy score does not work well in the zero-shot setting

jimo17 commented 1 year ago

Hi, thank you very much for your reply. In Table.2 of your paper, the Energy-based approach works quite well. But the AUROC I ran out with your code is only 0.60+. I'm not sure if I'm doing something wrong. In Table.2 of your paper, what is the temperature setting based on the Energy method?

alvinmingsf commented 1 year ago

In Table 2, the Energy baseline means the standard energy method where the energy score is applied to the logit layer added on top of the frozen visual features of CLIP after linear probe (https://github.com/openai/CLIP). The temperature is 1 as default. All implementation in this codebase are zero-shot methods (i.e. no additional layers are added) to avoid confusion. All scaling functions (energy, variance, entropy, etc) are directly applied to the cosine similarities.

jimo17 commented 1 year ago

Hello, thank you very much for your reply. Is the energy-based method in Table.2 of your paper, has the CLIP model been fine-tuned?

alvinmingsf commented 1 year ago

Yes, for the energy-based baseline in Table.2, as said above, we use a simple form of fine-tuning (linear probe i.e., the CLIP encoders are frozen, but an additional fully connected layer added on top of the CLIP visual encoder is fine-tuned)