Jingkang50 / OpenOOD

Benchmarking Generalized Out-of-Distribution Detection
MIT License
839 stars 106 forks source link

OOD detection with Mahalanobis distance method #154

Closed 2454511550Lin closed 1 year ago

2454511550Lin commented 1 year ago

Dear authors, I was running the script to perform OOD detection on NIPS' 18 MDS method with Cifar10 (ID) and its corresponding OOD datasets. I found that at the first step when loading the pre-trained model on cifar10, the test accuracy on Cifar10 is not correct. I was using the resnet-18 with 94.3% test ACC yet but the log shows that the ACC is only 40.77%. The result is bad accordingly:

dataset FPR@95 AUROC AUPR_IN AUPR_OUT CCR_4 CCR_3 CCR_2 CCR_1 ACC
cifar100 85.87 61.48 58.65 62.44 0.00 0.04 0.71 7.26 40.77
---------- -------- ------- --------- ---------- ------- ------- ------- ------- -------
tin 86.49 59.69 57.08 60.70 0.00 0.10 0.62 6.13 40.77
---------- -------- ------- --------- ---------- ------- ------- ------- ------- -------
nearood 86.18 60.58 57.87 61.57 0.00 0.07 0.67 6.69 40.77
---------- -------- ------- --------- ---------- ------- ------- ------- ------- -------
mnist 0.00 98.71 98.51 99.19 39.36 39.68 39.87 40.04 40.77
---------- -------- ------- --------- ---------- ------- ------- ------- ------- -------
svhn 94.62 65.20 50.56 80.11 0.47 2.47 6.89 15.17 40.77
---------- -------- ------- --------- ---------- ------- ------- ------- ------- -------
texture 64.41 78.05 83.66 71.90 0.22 0.92 4.59 16.86 40.77
---------- -------- ------- --------- ---------- ------- ------- ------- ------- -------
place365 93.60 52.33 21.88 80.93 0.00 0.10 0.59 5.36 40.77
---------- -------- ------- --------- ---------- ------- ------- ------- ------- -------
farood 63.16 73.57 63.65 83.03 10.01 10.79 12.98 19.36 40.77

I test other OOD methods such as ICML' 22 KNN, ICLR' 18 ODIN, and NIPS' 20 EBO, which all show the correct testing accuracy ( 94.3%). Any suggestions and help would be greatly appreciated!

zjysteven commented 1 year ago

The reason for this is that currently the ID accuracy is computed with the predictions that are derived from the corresponding post-processor, which may differ from the predictions of the underlying classifier. For the case of Mahalanobis detector, you can confirm it by looking at here. We are currently working on an updated version where the ID accuracy is decoupled with the post-processor. Stay tuned!

That said, I don't think the bad performance of MDS has anything to do with the ID accuracy. The detector itself should be functioning correctly, as for example the AUROC of CIFAR-10 v.s. MNIST is pretty high (>98%). In fact, MDS (the which uses intermediate layers' features) is known to excel at "easy" or "far"-OOD detection (e.g., CIFAR-10 v.s. MNIST), but may perform undesirably on more difficult OOD detection.

2454511550Lin commented 1 year ago
  1. That solves my confusion. Thank you!
  2. I see. I will try some hyperparameter tuning to see if I can get a better result. As I have seen from ICML' 22 KNN method [Link], table 1 shows a very high AUROC (>=90%) for different benchmarks on some old methods such as ODIN, Energy-based score, and MDS. image Thank you again for your quick response and the clarification!
zjysteven commented 1 year ago

One subtle detail about MDS is whether you compute the score using 1) only the features from the penultimate layer, or 2) the features from both intermediate and penultimate layer. These two ways can lead to very different results. Currently OpenOOD implements the second version, and we don't know which version is being used when papers report the results. This could make the reported results inconsistent and confusing.

Again, we are working on integrating the first version of MDS. In fact the first version's results are pretty close to those in the screenshot. We should be able to provide a more comprehensive list of results in the future.

2454511550Lin commented 1 year ago

Thank you for pointing this out. I will check their implementation to see if there is an inconsistency.

2454511550Lin commented 1 year ago

Hi there! I have a quick follow-up for hyperparameter tuning. For the MDS method, I simply set the noise to 0.3 and it gives a very good OOD detection result. The script I am using is as below in configs/postprocessors/mds.yml:

postprocessor:
  name: mds
  APS_mode: True
  postprocessor_args:
    noise: 0.0014
    feature_type_list: [mean]     # flat/mean/stat
    alpha_list: [1]
    reduce_dim_list: [none]   # none/capca/pca_50/lda
  postprocessor_sweep:
    noise_list: [0.3]

For Cifar 10, the result is as below: image

For Cifar100, the result is as below: image

This seems too good to be true. And some benchmarks even beat the results reported in SOTA. I think it also calls for the need to standardize what is the SOTA performance for different OOD detection methods on benchmarks, when someone claims that they come up with a better method.

Thank you again for organizing the great repository for the community!

zjysteven commented 1 year ago

Interesting. To me yes they are too good to be true especially in the case of CIFAR-10 v.s. CIFAR-100 or the other way around. Also a noise magnitude of 0.3 (in the L-infinity space) should be large enough to totally distort the image semantics. Maybe you can look closer into this. If not a bug then this is something that people definitely don't know yet :)

2454511550Lin commented 1 year ago

Thank you for the reply. I will look into it and make sure it is not a bug. I also want to point out that in the OpenOOD official paper, on page 8, table 1, you provide "an Excel table to show the full experiment results"(link) on Cifar100. If you look at line 5 where the result of MDS is reported, the AUROCs on different benchmarks are even higher than what I have done: image Three metrics are “FPR@95 / AUROC / AUPR”. I am not sure if I am missing something here.

zjysteven commented 1 year ago

I haven't joined the team yet when the paper was released so I have no idea. I do think something is wrong with the results there: at least I am pretty sure that MSP cannot achieve 98 auroc on CIFAR100 vs CIFAR10.

2454511550Lin commented 1 year ago

I see. It would be greatly appreciated if you can let me know when there is any update about the results. Thank you for the work!

zjysteven commented 1 year ago

No problem. Will keep you updated.

Jingkang50 commented 1 year ago

Thank you for the reply. I will look into it and make sure it is not a bug. I also want to point out that in the OpenOOD official paper, on page 8, table 1, you provide "an Excel table to show the full experiment results"(link) on Cifar100. If you look at line 5 where the result of MDS is reported, the AUROCs on different benchmarks are even higher than what I have done: image Three metrics are “FPR@95 / AUROC / AUPR”. I am not sure if I am missing something here.

I guess the screenshot is for the benchmark when MNIST is the ID?

Jingkang50 commented 1 year ago

Hi there! I have a quick follow-up for hyperparameter tuning. For the MDS method, I simply set the noise to 0.3 and it gives a very good OOD detection result. The script I am using is as below in configs/postprocessors/mds.yml:

postprocessor:
  name: mds
  APS_mode: True
  postprocessor_args:
    noise: 0.0014
    feature_type_list: [mean]     # flat/mean/stat
    alpha_list: [1]
    reduce_dim_list: [none]   # none/capca/pca_50/lda
  postprocessor_sweep:
    noise_list: [0.3]

For Cifar 10, the result is as below: image

For Cifar100, the result is as below: image

This seems too good to be true. And some benchmarks even beat the results reported in SOTA. I think it also calls for the need to standardize what is the SOTA performance for different OOD detection methods on benchmarks, when someone claims that they come up with a better method.

Thank you again for organizing the great repository for the community!

That sounds pretty interesting! Thank you for your exploration and looking forward to the updating!

2454511550Lin commented 1 year ago

Thank you for the reply. I will look into it and make sure it is not a bug. I also want to point out that in the OpenOOD official paper, on page 8, table 1, you provide "an Excel table to show the full experiment results"(link) on Cifar100. If you look at line 5 where the result of MDS is reported, the AUROCs on different benchmarks are even higher than what I have done: image Three metrics are “FPR@95 / AUROC / AUPR”. I am not sure if I am missing something here.

I guess the screenshot is for the benchmark when MNIST is the ID?

Thank you for pointing this out. I did not notice that there are different tables on the down left. Yes, the result is when MNIST is the ID. I looked at the results for Cifar-10/100, and they are "reasonable".

I will dig into the MDS method and see what was happening there. Thanks!

matchyc commented 9 months ago

Hi there! I have a quick follow-up for hyperparameter tuning. For the MDS method, I simply set the noise to 0.3 and it gives a very good OOD detection result. The script I am using is as below in configs/postprocessors/mds.yml:

postprocessor:
  name: mds
  APS_mode: True
  postprocessor_args:
    noise: 0.0014
    feature_type_list: [mean]     # flat/mean/stat
    alpha_list: [1]
    reduce_dim_list: [none]   # none/capca/pca_50/lda
  postprocessor_sweep:
    noise_list: [0.3]

For Cifar 10, the result is as below: image

For Cifar100, the result is as below: image

This seems too good to be true. And some benchmarks even beat the results reported in SOTA. I think it also calls for the need to standardize what is the SOTA performance for different OOD detection methods on benchmarks, when someone claims that they come up with a better method.

Thank you again for organizing the great repository for the community!

@2454511550Lin hi, did you make any new progress on this phenomenon?