Closed wielandbrendel closed 5 years ago
Thanks for your interest! We noted in Sec 6.1 that our training system may differ from ALP in many aspects. Details like hyper-parameters, training samples generation, optimizer, distributed SGD strategy may all have a big impact on the results. At this stage we do not have a definite answer why our baseline is much better. But we're also interested in investigating it in future research.
Thanks for your answer @ppwwyyxx. Assuming your results are correct, shouldn't this be the main result of your paper (getting adversarial training to work on ImageNet scale)? The advance of the denoising compared to baseline is relatively small but it's totally unclear right now where the major advance in what you call the baseline is actually coming from.
We do agree that getting adversarial training to work on ImageNet is a big success. However, we mainly attribute this success to a successful implementation of adversarial training.
For the adversarial training part, we exactly follow the training procedure in Madry's paper and ImageNet in 1 hour paper. Actually, we do not even meet too much trouble to get this strong baseline by simply following these two papers. In other words, the baseline "just works" and we have not found anything there that needs to be highlighted in the paper. We guess the heavy computation requirement of adversarial training on ImageNet is the main reason that the community does not see this strong baseline earlier.
In terms of absolute number, adversarial training indeed provides more benefits. But we do argue that improving robustness over this strong baseline is a difficult & challenging task, especially on ImageNet. (e.g., compared to Madry's baseline, it is non-trivial to improve the robustness on MNIST & CIFAR further) Our paper shows that feature denoising can give you such additional benefit. We believe this is an important research result to share with the community.
We hope the released models can be verified by the community, and can help other researchers to develop stronger models.
First of all, thanks for the interesting paper!
It is indeed very interesting to understand what is the main contribution - proper adversarial training or the proposed feature denoising. We did some independent evaluation of your models and think that it is rather adversarial training. Which is also implied directly by the results shown in your paper (although the text emphasizes the denoising blocks more).
In our recent paper where we studied the robustness of logit pairing methods (Adversarial Logit Pairing, Clean Logit Pairing, Logit Squeezing) we observed that only increasing the number of iterations of PGD may not be always sufficient to break a model. Thus, we decided to evaluate your models with the PGD attack with many (100) random restarts. The settings are eps=16, step_size=2, number_iter=100, evaluated on 4000 random images from the ImageNet validation set. Here are our numbers (thanks to Yue Fan for these experiments):
Model | Clean acc. | Adv. acc. reported | Adv. acc. ours |
---|---|---|---|
ResNet152-baseline, 100% AT RND | 62.32% | 39.20% | 34.38% |
ResNet152-denoise, 100% AT RND + feature denoising | 65.30% | 42.60% | 37.25% |
I.e. running multiple random restarts allows reducing the adversarial accuracy by ~5%. This suggests that investing computational resources in random restarts rather than more iterations pays off. And most likely, it’s possible to reduce it even a bit more with a different attack or more random restarts. But note that the drop is not so dramatic as it was for most of the logit pairing methods.
Obviously, it’s hard to make any definite statements unless one shows also strong results on certified robustness which are hard to get. But it seems that the empirical robustness presented in this paper is indeed plausible, and proper adversarial training on ImageNet can work quite well under eps=16 and a random target attack.
We hypothesize that the problem is that previous literature (up to our knowledge, it was only one paper -- the ALP paper) just applied multi-step adversarial training on ImageNet incorrectly (interesting question: what exactly led to the lack of robustness?). Obviously, it’s very challenging to reproduce all these results since it requires hundreds of GPUs (424 GPUs for ALP paper and 128 GPUs for this paper) to train such models. The only feasible alternative for most research groups is Tiny ImageNet. Therefore, we trained some Tiny ImageNet models from scratch in our recent paper. Here is one of the models trained following the adv. training of Madry et al with the least-likely target class, while the evaluation was done with a random target class:
Model | Clean acc. | Adv. acc. |
---|---|---|
ResNet50 100% AT LL (Table 3) | 41.2% | 16.3% |
The main observation is that we also couldn’t break this model completely! Note that the original clean accuracy is not so high (41.2%), but even in this setting, we couldn’t reduce the adversarial accuracy lower than 16.3%. This is in contrast to Plain / CLP / LSQ models which have adversarial accuracy close to 0%. So it seems that adv. training with a targeted attack indeed can work well on datasets larger than CIFAR-10.
We also note that according to our Tiny ImageNet results, 50% adv. + 50% clean training can also lead to robust models (e.g. see Table 4, the most robust model is actually 50% AT + ALP). So I wouldn’t be so sure about this statement:
One simple example is that 50% adversarial + 50% clean will not result a robust model on ImageNet
So probably there was some other problem in the implementation of adv. training in the ALP paper.
Also, we think that ImageNet seems to be a quite special dataset for measuring adversarial robustness. As was pointed out in Obfuscated gradients paper, one shouldn’t perform an untargeted attack since there are always classes that are extremely close to each other (e.g. different dog breeds). Thus, one has to use a targeted attack, which is an easier attack to be robust against. Therefore, it seems that e.g. CIFAR-10 with eps=16 with any target class can be an even more challenging task than ImageNet (implied by the numbers of Table 2 vs Table 3 in our paper). Thus, we think, having results only on ImageNet may not give the full picture, and also showing results on CIFAR-10 may shed more light on the importance of adv. training vs feature denoising.
To summarize: adversarial training made right seems to be pretty powerful :-) We hope these thoughts may clarify things a little bit more.
So probably there was some other problem in the implementation of adv. training in the ALP paper.
There was not necessarily a bug in the ALP paper. This paper uses a larger batch size (4096), and the PGD attacker during training takes 30 steps not just 10.
There was not necessarily a bug in the ALP paper.
In my opinion, it's unlikely to be a bug, but I think rather there was some crucial step missing (or indeed a bad hyperparameter) in their implementation of adversarial training that actually led to a non-robust model. And it's interesting to understand which exactly.
This paper uses a larger batch size (4096)
But ALP paper reports using the batch size of 1600: "Large batch training helped to scale up adversarial training as well: each replica had a batch size of 32, for an effective batch size of 1600 images." which is roughly on the same order. So I don't think that 1600 vs 4096 would make a substantial difference.
and the PGD attacker during training takes 30 steps not just 10.
But Madry et al, 2018 use only 7 steps for their adversarial training on CIFAR-10, and it leads to an empirically robust model. Do you have some experiments that would confirm that 10 vs 30 iterations of PGD make a significant difference in terms of robustness for some model/dataset?
which is roughly on the same order
On a logarithmic scale, the difference it is somewhat noticeable, especially given in ALP that the gradient update is mixed with gradients for clean classification. In large-scale out-of-distribution detection, I've found increasing the batch size from 64 to 256 to be critical for techniques such as Outlier Exposure (IIRC), even though this is a factor of 4. However, I suspect the real cause is the increased number of attacker steps.
Do you have some experiments that would confirm that 10 vs 30 iterations of PGD make a significant difference in terms of robustness for some model/dataset?
No, but given that ALP worked well against a 10-step adversary yet not adversaries which take more steps, training against adversaries with more steps sounds like it could lead to improvements. Figure 6 of this paper shows that the attacker experiences diminishing returns in error rate increases around 30 steps.
I have a quick question: your baseline almost reaches the denoising model results. However, as far as I can see the baseline is a fairly standard adversarial training procedure that was also tested in the ALP paper (M-PGD). There the baseline reached only single-digit accuracy numbers against simple PGD attacks. Would be great if you could clarify what I am missing.