DequanWang / tent

ICLR21 Tent: Fully Test-Time Adaptation by Entropy Minimization
https://arxiv.org/abs/2006.10726
MIT License
344 stars 43 forks source link

Performance on CIFAR-10-C deteriorates with more epochs #7

Closed 1ho0jin1 closed 3 years ago

1ho0jin1 commented 3 years ago

Hello! First of all, thank you for sharing your great work :) I've tested your code on CIFAR-10-C (severity 5) using CIFAR-10 pretrained ResNet 18/34/50, following the configuration in /master/cfgs/tent.yaml. While the tent indeed increases the classification performance at the very first epoch, it deteriorates with more epochs (even below the baseline model). This is my result:

Models Baseline (%) Tent (%) (epoch 1/2/3)
ResNet18 62.23 78.85 / 77.02 / 74.33
ResNet34 62.43 74.31 / 64.24 / 53.74
ResNet50 61.84 67.58 / 47.56 / 34.82

I expected the performance to gradually increase, but it didn't. Is this result reasonable, or did I do something wrong? Thank you :)

DequanWang commented 3 years ago

Thanks for pointing out such an interesting phenomenon. I think the most sensitive hyperparameter here is the optimization method with the initial learning rate. So my suggestion could be turning down the learning rate with SGD instead of the default Adam when you want to run it longer.

shelhamer commented 3 years ago

Hey @1ho0jin1, thanks for giving tent a try and sharing your results!

While the tent indeed increases the classification performance at the very first epoch, it deteriorates with more epochs (even below the baseline model).

This result is reasonable, but not necessarily the best tent can do. In the paper we emphasize making one update per batch (== online optimization) because we found that gives a reliable improvement. Plus, it is the "minimal" amount to update for computational efficiency. With more updates, such as offline optimization for >1 epoch, the results may stay the same, improve, or degrade depending on the data and optimization settings. As @DequanWang advised, when optimizing for multiple epochs it can help to tune the learning rate lower and switch to a non-adaptive optimizer like SGD. That said, there is no guarantee that tent updates will always improve the error, as it updates by unsupervised optimization.

Please consider this discussion about the stability of tent updates during review:

"Does performance degrade after a certain number of epochs, or does it remain stable?"

Tent largely remains stable, but this depends. For domain adaptation on SVHN-to-MNIST, error keeps improving at 10 epochs to 8.2% (Table 3) and at 100 epochs to 6.5% (evaluated for rebuttal). For corruption on CIFAR-100, error worsens slightly from 37.3% at 1 epoch to 38.0% at 10 epochs. This is due to our parameterization with feature modulation. With more free parameters, for instance optimizing only the last layer of the network, error first improves and then degrades. If all parameters are optimized, then optimization fails, and error at only 1 epoch is worse than the unadapted source model.

In summary, we found that more epochs helps for digits DA but hurts for CIFAR-100-C.

I should point out that this issue raises a research question: how can test-time adaptation methods decide when to update or not, and in particular when to stop? Tent is simple in this respect, and always updates on every batch, but perhaps a more sophisticated rule could halt optimization before results deteriorate.

wangtingwei1993 commented 2 years ago

@1ho0jin1 Hi, I think ResNet18/34/50 are not officially supported in the RobustBench project. I want to know whether you trained the standard models on the cifar10 dataset yourself? I mean the models for the baseline numbers. Thanks!

1ho0jin1 commented 2 years ago

@wangtingwei1993 Yes I trained them myself. I think they achieved 90%+ accuracy on CIFAR-10 test set, so they were good enough baselines I guess :)

wangtingwei1993 commented 2 years ago

@1ho0jin1 Thanks for your kind reply!