evaluating-adversarial-robustness / adv-eval-paper

LaTeX source for the paper "On Evaluating Adversarial Robustness"
https://arxiv.org/abs/1902.06705
249 stars 33 forks source link

Studying robustness wrt. other attacks, distal adversarial examples, details of success rate computation, evaluation of detection methods #15

Open davidstutz opened 5 years ago

davidstutz commented 5 years ago

Hi,

first of all, I want to say that I enjoyed reading the paper and I think it’s a useful collection of best practices. I also like the “open” character of the paper, so I thought I would leave some comments and thoughts regarding the current ArXiv version of the paper.

  1. Section 2.1: The motivation of defending against an adversary is tailored towards adversarial examples, i.e., evasion attacks. I believe that robustness also extents beyond evasion attacks; meaning that robustness against evasion attacks might also be beneficial against other attacks (model stealing, backdooring etc.). Although there is, to the best of my knowledge, no work on this relationship yet, I think that this might be an additional important motivation of studying robustness (although, I noticed that, in Section 2.2.2, it is said that such attacks may be out of scope).
  2. Section 2.2.2: I think it is important to stress that the goal of not changing the true label is to be undetected. While being undetected implies not changing the label for a reasonable image from the test set, this might not hold the other way around. Distal adversarial examples, for example, might be valid attacks, however they are potentially easier to detect – at least for us humans – and it is unclear what the true label of these distal examples should be.
  3. Footnote 4: I find that comment too important to hide in a footnote. In fact, throughout the paper, I did not find any clear recommendation of how to compute success rates/robust test errors. However, I believe that this would benefit the community in that reported numbers would be more comparable. For example, some works consider test errors as “trivial” adversarial examples. As a result, the reported numbers mix up generalization and robustness. Other works report success rates only on correctly classified test images. Personally, I prefer to have test errors and adversarial examples treated differently, which makes assessing the impact of defenses on generalization and robustness easier to grasp. However, independent of my opinion, I think that a discussion would be beneficial. The discussion could, for example, be included in Section 5.6.
  4. Section 2.2.3: It is stated that robustness against white-box robustness implies robustness in the black-box case. In theory, I agree with that statement. However, in practice this is not the case – as is discussed in several sections later on (e.g. 4.5 and 5.7). In that sense, later recommendations contradict this statement. So, a more differentiated statement might make things clearer from the beginning.
  5. Section 3.3: I am not sure if I understand the idea of per-example success rate correctly. As I understand it, it’s about computing the mean over the worst attack per example, right? In that case, a couple of more sentences would make things clearer, especially as I find the questions more confusing than helpful, as success rate is never really defined, and f is supposed to be the model's output (as defined in the beginning).
  6. Section 4.1: In the last paragraph, I think it is also important to note that statements might not only be restricted to a set of examples, but are sometimes also probabilistic in the sense that bounds hold with some probability. While in machine learning such statements are common, I think that from a security perspective it might make a difference – as in the worst case, such bounds do not hold at all.
  7. Section 5.2 (paragraphs 4/5): It is unclear whether accuracy is considered with respect to random perturbations or with respect to adversarial perturbations. In the former case, the model will perform close to random guessing, but in the latter case, the model can perform worse than random guessing (e.g. a model with 100% accuracy on clean images, and an attack with 100% success rate).
  8. Section 4.2: This is the only section that considers the specific case of rejecting test/adversarial images. This also means that many other sections do not consider the case of defense-by-detection. Although I am aware of the fact that many detection schemes have been shown to be ineffective, I think that evaluation of detection is slightly different from evaluating other defenses. For example, in addition to a ROC curve as stated in Section 4.2, success rates and accuracies are not meaningful anymore. Instead, it might be more meaningful to compute success rates and accuracies at a specific threshold leading to the rejection of only X% clean test images. I think a more detailed discussion of these specifics might be beneficial. I am not sure if Section 4.2 is the right place for this, though.
mpcasey2718 commented 4 years ago

For item 5 and checking section 5.6, I believe $f$ here is supposed to be the binary success/fail indicator for an attack on an image, while $X$ is the testing dataset, and $A$ is the set of attacks. The \mean\min formulation will take into account whether an image was misclassified by any of the attacks.