Jingkang50 / OpenOOD

Benchmarking Generalized Out-of-Distribution Detection
MIT License
858 stars 108 forks source link

Fixing OE training #138

Closed zjysteven closed 1 year ago

zjysteven commented 1 year ago

Hi,

In the paper the reported results of OE (both OOD performance and ID accuracy) are significantly WORSE than other methods, which goes against the observations of the OE paper and my experience/impression. After looking at the code, I identify two issues in OpenOOD's current implementation.

First, the outlier data and ID data are passed through the network in separate forward runs, while in OE's official implementation they are concatenated in a single batch and passed through the network together. The current implementation is likely to cause an unstable estimation of BN statistics due to the difference between ID and outlier data. https://github.com/Jingkang50/OpenOOD/blob/539cf436757b1778ae7baf285b859ba6ca771ed2/openood/trainers/oe_trainer.py#L57

Second, there seem to be something wrong with the SoftCrossEntropy loss defined here. Basically with this loss the model accuracy would be rather low, while simply replacing it with the provided loss in the official implementation will fix this issue.

Overall, once fixing the above two issues by following OE's official implementation, I can get these numbers on CIFAR-10: acc nearood farood
paper not reported 76.4 75.2
fixed 94.89 93.51 95.40

The results after fixing the bugs make much more sense to me.

Jingkang50 commented 1 year ago

Dear Jingyang, Thank you for all your comments, including your previous issues! I am looking into your feedback this week.

zjysteven commented 1 year ago

@Jingkang50 No worries. Hope that my comments can be helpful. And just to be clear the "fixed" numbers in the above table were obtained with a 200 epochs budget and a batch size of 256 for unlabeled outlier data, while the default configuration is 100 epochs and 200 as the batch size.

zjysteven commented 1 year ago

Another issue I identified is that while certain Tiny ImageNet test images are included in the near-OOD split for CIFAR-10/100, the whole Tiny ImageNet training images are used as the outlier data for OE. This makes the training OOD distribution completely overlap with the test OOD distribution. In fact, in the table that I put in my first comment, the 93.51% AUROC for CIFAR-10 v.s. near-OOD is the average of 87.02% (against CIFAR-100) and 99.99% (against Tiny ImageNet).

To me the simplest fix is to remove Tiny ImageNet from the near-OOD split, which has the following advantages compared with changing the source of outlier data:

zjysteven commented 1 year ago

Fixed in #150