Jingkang50 / OpenOOD

Benchmarking Generalized Out-of-Distribution Detection
MIT License
825 stars 106 forks source link

A bit confused about the `Gram` implementation #173

Open SauceCat opened 1 year ago

SauceCat commented 1 year ago

https://github.com/Jingkang50/OpenOOD/blob/main/openood/postprocessors/gram_postprocessor.py#L115 I wonder why the dev is used as conf directly? Isn't the larger the deviations the more likely the sample would be an OOD?

I checked the original implementation here: https://github.com/VectorInstitute/gram-ood-detection/blob/master/ResNet_Cifar10.ipynb I found it actually using the negative of deviations when calculating the metrics.

But the metric looks ok, so I am quite confused. Am I missing sth?

zjysteven commented 1 year ago

It's indeed confusing. Like you said we are getting >50% AUROC for GRAM in all our experiments. Applying the negation seems the correct thing to do, but will lead to <50% AUROC. I will try to do some investigation when I'm available.

chandramouli-sastry commented 6 months ago

Hi, I worked on the gram matrix method and I seem to have fixed the implementation here

The results on CIFAR10 obtained with the current gram-matrix implementation are as follows:

                FPR@95         AUROC       AUPR_IN       AUPR_OUT           ACC
cifar100    91.68 ± 2.24  58.33 ± 4.49  56.74 ± 3.87   59.24 ± 4.62  95.06 ± 0.30
tin         90.06 ± 1.59  58.98 ± 5.19  61.65 ± 3.75   55.89 ± 5.56  95.06 ± 0.30
nearood     90.87 ± 1.91  58.66 ± 4.83  59.19 ± 3.79   57.57 ± 5.09  95.06 ± 0.30
mnist       70.30 ± 8.96  72.64 ± 2.34  36.92 ± 8.23   93.36 ± 1.21  95.06 ± 0.30
svhn       33.91 ± 17.35  91.52 ± 4.45  82.40 ± 8.85   96.62 ± 1.81  95.06 ± 0.30
texture     94.64 ± 2.71  62.34 ± 8.27  67.93 ± 5.60  55.93 ± 10.76  95.06 ± 0.30
places365   90.49 ± 1.93  60.44 ± 3.41  26.94 ± 2.62   85.64 ± 1.31  95.06 ± 0.30
farood      72.34 ± 6.73  71.74 ± 3.20  53.55 ± 4.74   82.89 ± 3.14  95.06 ± 0.30

With the corrected implementation, I was able to get:

                 FPR@95         AUROC       AUPR_IN      AUPR_OUT           ACC
cifar100   61.61 ± 0.82  84.61 ± 0.20  84.21 ± 0.20  83.75 ± 0.32  95.06 ± 0.30
tin        51.99 ± 1.16  87.16 ± 0.52  88.46 ± 0.41  84.34 ± 0.83  95.06 ± 0.30
nearood    56.80 ± 0.62  85.88 ± 0.35  86.33 ± 0.28  84.04 ± 0.56  95.06 ± 0.30
mnist       7.31 ± 1.02  97.57 ± 0.49  94.37 ± 0.85  99.48 ± 0.14  95.06 ± 0.30
svhn        6.67 ± 0.29  98.64 ± 0.02  96.73 ± 0.06  99.48 ± 0.04  95.06 ± 0.30
texture    14.86 ± 0.71  96.95 ± 0.11  97.99 ± 0.12  95.53 ± 0.09  95.06 ± 0.30
places365  42.81 ± 2.19  89.56 ± 0.80  73.32 ± 1.45  96.53 ± 0.33  95.06 ± 0.30
farood     17.91 ± 0.70  95.68 ± 0.25  90.60 ± 0.35  97.75 ± 0.09  95.06 ± 0.30

When I used the same checkpoints with the code referred to by @SauceCat , I was able to get marginally higher results for SVHN but did not test other datasets. The new code is not fully polished but seems to be working as expected. I also did not run experiments on datasets other CIFAR10 as InD.

Thank you for the OpenOOD benchmark and considering the gram matrix method for inclusion in the benchmark!

zjysteven commented 6 months ago

@chandramouli-sastry Thanks for sharing the results, and glad to see the much improved numbers with the updated implementation. Would you mind opening a pull request for this? Meanwhile we will update the gram matrix results in both the paper and the leaderboard.

chandramouli-sastry commented 6 months ago

Thank you! I just created a pull request for your review.