evaluating-adversarial-robustness / adv-eval-paper

LaTeX source for the paper "On Evaluating Adversarial Robustness"
https://arxiv.org/abs/1902.06705
249 stars 33 forks source link

Examples of or recommendations for good adaptive evaluations #11

Open ftramer opened 5 years ago

ftramer commented 5 years ago

First, thanks for the great work in setting up this document!

The checklist and detailed explanations in Section 3-5 seem to mostly cover recommendations for how to evaluate defenses using currently known (presumably non-adaptive) attacks. These are of course extremely valuable, as they are rarely followed rigorously in papers today.

Yet even if they were followed, I think many defenses (especially ones that merely detect adversarial examples) could pass such a stringent evaluation if the attack is not properly adapted to the defense. The current paper does touch on adaptive attacks, but doesn't give much more advice beyond "use adaptive attacks".

I wonder whether we could start a discussion on some general principles that make up a good adaptive attack. In my personal experience, creating adaptive attacks has often been quite an ad-hoc process, and I've encountered a few papers that claim an adaptive, yet unconvincing evaluation. So if anyone has some principles or guidelines to share that they've found useful in the past for creating good adaptive attacks, it would be great to hear about them.

At the very least, I think it would be worthwhile for the paper to explicitly point to and discuss works that are believed to have performed a good adaptive analysis (there's of course a bunch of attack papers in this list, but identifying some recent defense papers that seem to do the right thing would be very useful for readers).

earlenceferns commented 5 years ago

Great comment. I agree that trying to see what might be common to adaptive attacks could be useful. I guess one thing mentioned in the paper already is to use approximations for non-differentiable operations (i.e, the attacker is adapting to the non-differentiable part of the net). Maybe it makes sense to pull this, and other ideas out into a more well-defined section on common strategies for adaptive attacks.

carlini commented 5 years ago

Including how do good thorough adaptive analysis would definitely be wonderful. I personally don't know if I have anything concrete to say on the topic, unfortunately. I worry that if we give some advice on it, then it will cause some people to just follow that approach without thinking more carefully. Which is why we focus mainly giving concrete advice on the things to not do, and just give rather broad discussions of how to do good adaptive evaluations.

If you can think of anything generally helpful here, that would be great. Including pointers to papers with good evaluations sounds like an excellent idea as case studies for how people have done good evaluations in the past.

ftramer commented 5 years ago

The problem I could see with pointing to "good" papers is a form of gatekeeping (e.g., most of the papers I can think of are co-written by some of the authors of this paper).

carlini commented 5 years ago

Hm. That's true, I don't want it to seem like we are pushing our own work preferentially. It also might set the stage for a future debate where people ask for their papers to be included in the "defense papers with good evaluations" list.

ftramer commented 5 years ago

Maybe the point worth expanding on is the recommendation "The loss function is changed as appropriate to cause misclassification".

I think the specific point to make here is that you should think hard about why you think your defense works, and then build a new loss function that explicitly attacks this assumption. I think it would be worth giving a generic example for adversarial examples detection, where this problem seems most prevalent. E.g., if your defense is based on the assumption that adversarial examples satisfy property P (and regular examples don't), then you have to build a continuous differentiable loss L such that minimizing L is a proxy for changing P (and prove or at least argue that this is the case). Then you should apply existing attacks (with all the other guidelines) using L, even if you never used L to train your network.

This suggestion may seem obvious (maybe it is), but at least I think there's relatively little that could go wrong with someone doing this.

earlenceferns commented 5 years ago

Another idea might be to say: if you use another "sentry" ML model as a "detector" or "robustifier" then an attacker might try to adapt and attack that sentry also, thus simultaneously fooling the original model, and the sentry model.

ftramer commented 5 years ago

Agreed. What's nice about the above generic formulation is that it can cover your example as a special case: the property P that distinguishes adversarial and natural examples is simply the output of your detector or robustifier.