Impact and remaining misconceptions

ftramer commented 4 years ago

It's been roughly 6 months since these guidelines were published, and many people probably read or reviewed their share of the hundreds of adversarial examples papers published since.

What's your sense on how robustness evaluations have been improving overall?

I optimistically think that authors are increasingly trying to do more thorough evaluations. Nevertheless, there's a few misconceptions (I think) that I've seen around and that could be worth discussing / clarifying:

1) Misconception: Evaluating against prior adaptive attacks is a reasonable adaptive analysis.

It goes without saying that this is flawed, but it's quite easy to fall into this trap by focusing on well-known attacks from prior work. For instance, I've seen a few papers that apply BPDA as the sole adaptive attack, which seems insufficient to me. This connects to https://github.com/evaluating-adversarial-robustness/adv-eval-paper/issues/11, in that there aren't really any principled guidelines for how to perform a good adaptive evaluation.

2) Misconception: Gradient-free attacks (e.g., SPSA, ZOO, NES) are a good way to detect gradient-masking.

While this may be true for some forms of gradient masking, I wonder what people's experience is with this? I often see SPSA mentioned as an attack people should try, and this report says that "[SPSA] has broken many [defenses]". In the original SPSA paper they break PixelDefend using this attack, but use other strategies for other defenses. What are other examples of defenses that people have successfully attacked using gradient-free attacks?

Intuitively I would think that approximating gradients via finite-differences won't help much for many forms of gradient-masking (e.g., distillation, randomization, discretization, etc.)

Instead, I think that hard-label (decision-based) attacks should really be applied more often for this purpose, but still relatively few papers use them. It might be worth emphasizing these attacks further, especially since there's been some recent work in making these attacks more effective (e.g., https://arxiv.org/abs/1904.02144).

Any thoughts on these two points, or on other things you've noticed in recent papers?

anishathalye commented 4 years ago

I second your point (1), I've seen this in many papers over the last 6 months. I thought this manuscript was doing a reasonably good job of explaining what is an adaptive attack (Section 2.5), but perhaps we need to improve the explanation? I don't know what needs to be improved, though. Perhaps it is worth spelling out that an attack being adaptive is a property of an attack and a defense; it doesn't make sense to e.g. say "BPDA Is an adaptive attack", independently of a defense. I've seen several papers do this.

carlini commented 4 years ago

Regarding how evaluations are going. It's hard I think to judge on the whole. Clearly there are some evaluations that are done better, but there are still plenty that are not.

I would also agree with your sentiment, and adding longer a discussion of hard-label attacks would be good. There are quite a few out now; one thing I feel uneasy about is I don't know which to recommend. However it would definitely be worth pointing at the class of attacks as a whole to suggest.

For "how to do adaptive attacks" I agree there's not much in the way of how to do it. Mostly because I don't think I know a way to write down steps for what it actually means, other than "try hard to break your thing". Do you have anything that you think could be actionable? If so that would be great.

ftramer commented 4 years ago

I agree that recommending a specific hard-label attack seems tricky right now. A more in-depth discussion of hard-label attacks would also be useful because they currently remain quite brittle (at least from personal experience). Sometimes they work extremely well (especially on MNIST), but for larger datasets or adversarially trained models they can require a lot of parameter tuning and many random restarts to get good results. One worry (and I've already seen this) is that authors use the fact that hard-label attacks fail as indication that their defense is robust. As you note in https://github.com/evaluating-adversarial-robustness/adv-eval-paper/issues/23, this can be achieved by a trivial randomized defense.

For adaptive attacks I unfortunately don't have anything particularly concrete at the moment either.

The following idea is unrelated to these guidelines, but reproducing published defenses and exploring adaptive attacks might make for a good hackathon-style challenge. The ICLR 2019 reproducibility challenge had a few entries that were reproducing defenses but as far as I know, did not try to explicitly break them.

wielandbrendel commented 4 years ago

I agree that hard-label (or decision-based) attacks can be brittle in some scenarios. In my personal experience with the boundary attack I found that it typically works quite well in untargeted scenarios even on large datasets or adversarially trained models, but it tends to fail in targeted scenarios (unless you tune the parameters really well) and it definitely fails on stochastic models. For this reason I agree that authors should definitely not take a failure of hard-label attacks as an indication that their model is robust. We are currently comparing different hard-label attacks, so I hope I have an answer on which one to recommend soon.

@ftramer I pitched a similar competition idea to @carlini and Aleksander Madry a few months ago but the discussion somehow died over the summer. I am happy to revive and include you in the discussion.

ftramer commented 4 years ago

@wielandbrendel sounds good! ICLR might well be a good playground for this again given that all submissions are public. It also shouldn't be too hard to come up with a list of 10-20 papers from the past years that proposed interesting defense ideas that no one ever looked too carefully at.

evaluating-adversarial-robustness / adv-eval-paper

Impact and remaining misconceptions #22