Closed 2454511550Lin closed 1 year ago
The reason for this is that currently the ID accuracy is computed with the predictions that are derived from the corresponding post-processor, which may differ from the predictions of the underlying classifier. For the case of Mahalanobis detector, you can confirm it by looking at here. We are currently working on an updated version where the ID accuracy is decoupled with the post-processor. Stay tuned!
That said, I don't think the bad performance of MDS has anything to do with the ID accuracy. The detector itself should be functioning correctly, as for example the AUROC of CIFAR-10 v.s. MNIST is pretty high (>98%). In fact, MDS (the which uses intermediate layers' features) is known to excel at "easy" or "far"-OOD detection (e.g., CIFAR-10 v.s. MNIST), but may perform undesirably on more difficult OOD detection.
One subtle detail about MDS is whether you compute the score using 1) only the features from the penultimate layer, or 2) the features from both intermediate and penultimate layer. These two ways can lead to very different results. Currently OpenOOD implements the second version, and we don't know which version is being used when papers report the results. This could make the reported results inconsistent and confusing.
Again, we are working on integrating the first version of MDS. In fact the first version's results are pretty close to those in the screenshot. We should be able to provide a more comprehensive list of results in the future.
Thank you for pointing this out. I will check their implementation to see if there is an inconsistency.
Hi there! I have a quick follow-up for hyperparameter tuning. For the MDS method, I simply set the noise to 0.3 and it gives a very good OOD detection result. The script I am using is as below in configs/postprocessors/mds.yml
:
postprocessor:
name: mds
APS_mode: True
postprocessor_args:
noise: 0.0014
feature_type_list: [mean] # flat/mean/stat
alpha_list: [1]
reduce_dim_list: [none] # none/capca/pca_50/lda
postprocessor_sweep:
noise_list: [0.3]
For Cifar 10, the result is as below:
For Cifar100, the result is as below:
This seems too good to be true. And some benchmarks even beat the results reported in SOTA. I think it also calls for the need to standardize what is the SOTA performance for different OOD detection methods on benchmarks, when someone claims that they come up with a better method.
Thank you again for organizing the great repository for the community!
Interesting. To me yes they are too good to be true especially in the case of CIFAR-10 v.s. CIFAR-100 or the other way around. Also a noise magnitude of 0.3 (in the L-infinity space) should be large enough to totally distort the image semantics. Maybe you can look closer into this. If not a bug then this is something that people definitely don't know yet :)
Thank you for the reply. I will look into it and make sure it is not a bug. I also want to point out that in the OpenOOD official paper, on page 8, table 1, you provide "an Excel table to show the full experiment results"(link) on Cifar100. If you look at line 5 where the result of MDS is reported, the AUROCs on different benchmarks are even higher than what I have done: Three metrics are “FPR@95 / AUROC / AUPR”. I am not sure if I am missing something here.
I haven't joined the team yet when the paper was released so I have no idea. I do think something is wrong with the results there: at least I am pretty sure that MSP cannot achieve 98 auroc on CIFAR100 vs CIFAR10.
I see. It would be greatly appreciated if you can let me know when there is any update about the results. Thank you for the work!
No problem. Will keep you updated.
Thank you for the reply. I will look into it and make sure it is not a bug. I also want to point out that in the OpenOOD official paper, on page 8, table 1, you provide "an Excel table to show the full experiment results"(link) on Cifar100. If you look at line 5 where the result of MDS is reported, the AUROCs on different benchmarks are even higher than what I have done: Three metrics are “FPR@95 / AUROC / AUPR”. I am not sure if I am missing something here.
I guess the screenshot is for the benchmark when MNIST is the ID?
Hi there! I have a quick follow-up for hyperparameter tuning. For the MDS method, I simply set the noise to 0.3 and it gives a very good OOD detection result. The script I am using is as below in
configs/postprocessors/mds.yml
:postprocessor: name: mds APS_mode: True postprocessor_args: noise: 0.0014 feature_type_list: [mean] # flat/mean/stat alpha_list: [1] reduce_dim_list: [none] # none/capca/pca_50/lda postprocessor_sweep: noise_list: [0.3]
For Cifar 10, the result is as below:
For Cifar100, the result is as below:
This seems too good to be true. And some benchmarks even beat the results reported in SOTA. I think it also calls for the need to standardize what is the SOTA performance for different OOD detection methods on benchmarks, when someone claims that they come up with a better method.
Thank you again for organizing the great repository for the community!
That sounds pretty interesting! Thank you for your exploration and looking forward to the updating!
Thank you for the reply. I will look into it and make sure it is not a bug. I also want to point out that in the OpenOOD official paper, on page 8, table 1, you provide "an Excel table to show the full experiment results"(link) on Cifar100. If you look at line 5 where the result of MDS is reported, the AUROCs on different benchmarks are even higher than what I have done: Three metrics are “FPR@95 / AUROC / AUPR”. I am not sure if I am missing something here.
I guess the screenshot is for the benchmark when MNIST is the ID?
Thank you for pointing this out. I did not notice that there are different tables on the down left. Yes, the result is when MNIST is the ID. I looked at the results for Cifar-10/100, and they are "reasonable".
I will dig into the MDS method and see what was happening there. Thanks!
Hi there! I have a quick follow-up for hyperparameter tuning. For the MDS method, I simply set the noise to 0.3 and it gives a very good OOD detection result. The script I am using is as below in
configs/postprocessors/mds.yml
:postprocessor: name: mds APS_mode: True postprocessor_args: noise: 0.0014 feature_type_list: [mean] # flat/mean/stat alpha_list: [1] reduce_dim_list: [none] # none/capca/pca_50/lda postprocessor_sweep: noise_list: [0.3]
For Cifar 10, the result is as below:
For Cifar100, the result is as below:
This seems too good to be true. And some benchmarks even beat the results reported in SOTA. I think it also calls for the need to standardize what is the SOTA performance for different OOD detection methods on benchmarks, when someone claims that they come up with a better method.
Thank you again for organizing the great repository for the community!
@2454511550Lin hi, did you make any new progress on this phenomenon?
Dear authors, I was running the script to perform OOD detection on NIPS' 18 MDS method with Cifar10 (ID) and its corresponding OOD datasets. I found that at the first step when loading the pre-trained model on cifar10, the test accuracy on Cifar10 is not correct. I was using the resnet-18 with 94.3% test ACC yet but the log shows that the ACC is only 40.77%. The result is bad accordingly:
I test other OOD methods such as ICML' 22 KNN, ICLR' 18 ODIN, and NIPS' 20 EBO, which all show the correct testing accuracy ( 94.3%). Any suggestions and help would be greatly appreciated!