csdongxian / AWP

Codes for NeurIPS 2020 paper "Adversarial Weight Perturbation Helps Robust Generalization"
MIT License
170 stars 19 forks source link

checkpoint is the best or the last epoch's weights? #2

Closed CNOCycle closed 3 years ago

CNOCycle commented 3 years ago

Hi authors,

I'm trying to reproduce the experimental results.

I followed the instruction in README.md and the model was trained by command python trades_AWP/train_trades_cifar.py

Then, I evaluated the last epoch's weights with autoattack and the robust accuracy is about 55.50~55.80% which is worst than 56.17% on leaderboard.

Could you explain the detail about which checkpoint is selected?

The following is the used packages from conda:

cudatoolkit               10.1.243             h6bb024c_0    nvidia
cudnn                     7.6.5                cuda10.1_0
pytorch                   1.7.0           py3.6_cuda10.1.243_cudnn7.6.3_0    pytorch
torchvision               0.8.1                py36_cu101    pytorch
csdongxian commented 3 years ago

Hello,

We report the results on the checkpoint with the best PGD-20 robustness. Specifically, we save all checkpoints during training, test their robustness under PGD-20 on the test set, and select the best one. This setting is originally from TRADES, and is inherited by succeeding papers [2,3,4]. We apply this setting for all defenses in Table 2 for a fair comparison.

Hope this addresses your question :)

[1] Theoretically Principled Trade-off between Robustness and Accuracy. In ICML, 2019. [2] Improving Adversarial Robustness Requires Revisiting Misclassified Examples. In ICLR, 2020. [3] Boosting Adversarial Training with Hypersphere Embedding. In NeurIPS, 2020. [4] Bag of Tricks for Adversarial Training. In ICLR, 2021.

CNOCycle commented 3 years ago

Thank you for sharing the details of the selecting strategy but I still have concerns about this strategy. I know that previous papers selected the best checkpoint from testing set but the checkpoint may overfit to testing test potentially. As far as i know, a proper strategy is that splitting a validation set from training set and selecting the best result from the validation set. What I want to say is that the testing set should be hidden until the final evaluation. Do I make problem too complicated?