Amending papers after attack

ftramer commented 3 years ago

Aleksander, Wieland, Nicholas and I have had some discussions about the lack of "self-correction" among (broken) defense papers, and how this can make it hard for newcomers to navigate the field (i.e., after reading about a defense, you have to sift through the literature to find out whether an attack on it exists or not). I think that a discussion of the "aftermath" of a defense evaluation would nicely fit in this report.

If this would make sense, some points worth discussing include:

What constitutes a "break" that is worth amending a paper over?

Essentially every robustness claim in the literature is "false", as you can always reduce accuracy by 0.1%-1% by fiddling with hyper-parameters. If a paper claims 60% robust accuracy, and a later attack reduces this to 59%, I doubt it's worth amending the paper. But what about 55%, 50%, or 10%? Maybe the only good solution here is to set up a public leaderboard, but I doubt that many authors would maintain one.

Would it make sense to give examples of papers that were amended?

I know of very few examples (each of which involves one or more authors of this report):

https://arxiv.org/abs/1705.07204 (my own paper, which I amended to acknowledge better transfer-based attacks)
https://arxiv.org/abs/1702.06280 (acknowledges a future break)
https://arxiv.org/abs/1706.06083 (corrected some robustness claims that were due to gradient masking)
https://arxiv.org/abs/1803.06373 (the paper was retracted from NeurIPS but not actually amended on arXiv)

max-andr commented 3 years ago

Since you mentioned public leaderboards, I feel like a pointer to our project can be relevant to this discussion: https://robustbench.github.io/ which is a standardized leaderboard that uses AutoAttack and only accepts models that satisfy some restrictions (no randomness, non-differentiability, and optimization at inference time) that mostly tend to make gradient-based attacks ineffective without substantially improving robustness.

One of the main ideas behind our project is that we do need to be able to systematically distinguish fine-grained differences between different models. This seems to me very related to what @ftramer mentions: perhaps 1% reduction in adversarial accuracy by using a different attack is not very interesting but 5% may be quite important as it's roughly the improvement one gets when using extra unlabeled data (e.g., as in Carmon et al., 2019). And when stacking such small 5%-improvements together, one can get overall quite substantial improvements. E.g., the top entry from DeepMind of the Linf CIFAR-10 leaderboard has 65.87% adversarial accuracy (evaluated with AutoAttack) compared to 44.04% of standard adversarial training from Madry et al., 2018.

If you find it useful, we would be happy to include adaptive evaluations (we also mention this point in our whitepaper) in our leaderboards for models that satisfy the restrictions mentioned above. So far I'm aware of only one case where adaptive attacks can noticeably reduce adversarial accuracy evaluated with AutoAttack (among those datasets / threat models that we have in our leaderboards): from 18.50% to 0.16% for the model from Enhancing Adversarial Defense by k-Winners-Take-All which you report in On Adaptive Attacks to Adversarial Example Defenses. But if there are more cases like this, it would be great to know so that we can provide a more complete picture with our leaderboards.

ftramer commented 3 years ago

Yes, a discussion of common attack benchmarks is definitely warranted here. The issue remains though that if the evaluation is performed by a third party, the original defense paper rarely (if ever) acknowledges this re-evaluation.

carlini commented 3 years ago

For what it's worth, "Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness" is also fully broken by our adaptive attacks paper and is not robust, but AutoAttack fails completely here to find any good attack.

ftramer commented 3 years ago

You might be thinking of a different paper? From what I see in the AutoAttack paper, it goes down to 0% accuracy (row 25 in Table 2, Pang et al. 2020).

carlini commented 3 years ago

Hm, you're right. I was looking at the leaderboard page:

"30 | Rethinking Softmax Cross-Entropy Loss for Adversarial Robustness | 80.89% | 43.48% | × | ResNet-32 | ICLR 2020"

davidwagner commented 3 years ago

One possible point of comparison might be to the cryptography literature, where similar challenges arise. My sense is that the situation in crypto is similar: if a scheme is broken, often that shows up publicly by publishing a follow-up paper (but not in any other way), so to tell whether a scheme has resisted scrutiny, one has to check for more recent papers that have cited it to see if any reports a stronger attack against it. I wonder if it's more challenging in adversarial ML because ML has more papers being published that propose defenses than crypto does?

ftramer commented 3 years ago

My understanding was that in crypto it is considered good practice to amend papers (on eprint, not in proceedings) after a break. I've definitely seen this a few times in the past but I don't know if the process is widespread.

I've also heard many stories in TCS or crypto where there is "folklore" knowledge that some scheme is broken or some Lemma/Theorem is incorrect, without this even being written down anywhere. So maybe a smaller community with fewer papers does help in propagating this knowledge.

This is somewhat problematic in a much bigger field such as ML, but maybe it is hard to set incentives differently to avoid this.

evaluating-adversarial-robustness / adv-eval-paper

Amending papers after attack #26