How to evaluate for the Robust Vision Benchmark?

bethgelab / foolbox

A Python toolbox to create adversarial examples that fool neural networks in PyTorch, TensorFlow, and JAX

https://foolbox.jonasrauber.de

MIT License

2.75k stars 426 forks source link

How to evaluate for the Robust Vision Benchmark? #263

Closed Spenhouet closed 5 years ago

Spenhouet commented 5 years ago

It is not clear to me how to run the same evaluation like it is done in the Robust Vision Benchmark.

In this benchmark you show scores for every model and multiple attacks. Is there any usable implementation to run the exact same evaluation?

What I would like is to just run something similar to the following:

import foolbox

model = foolbox.models.TensorFlowModel(features, logits, labels) 
evaluation_results = foolbox.benchmark.evaluate(model)

To then get the same evaluation results like on the benchmark results in a python object:

jonasrauber commented 5 years ago

Unfortunately, this does not yet exist. For the robust vision benchmark, we had a special setup customized to our GPU cluster to run the evaluations on many GPUs in parallel. In the end, what you need to do is to loop over the attacks and the images.

Spenhouet commented 5 years ago

I have a first implementation but I'm not sure if I'm doing everything right. The documentation is somewhat confusing.

If I run an attack with unpack=True then I get the adversarial object.

Do I understand correct, that the distance in the adversarial object is the MSE between input and output of the smallest perturbation image?
I don't see where the l2 norm fits in? Where do I have to apply it?
The total model score is the average of all attack scores? (if yes, than the Q&A section is very misleading)
If the models classification is wrong (without any perturbation) then the distance is 0? (I could find this in the FAQ but it is badly worded)
The FAQ also mentions that if no adversarial was found, then "the attack will return None". This looks incorrect to me. From my testing the attack still returns an adversarial object with a distance of np.inf. What object should be None?
How to handle misclassifications (wrong classification without any perturbation) for the benchmark? I opted to ignore all these for the median calculation. Is this correct?
The attack success rate is np.mean(1 - np.isinf(distances)) for a single attack on all test images?
The benchmark is performed on all test images of the respective dataset (for example on MNIST the 10k test images)?

Sorry for the overlap of these questions to the benchmark project.

When the above questions are answered, I will create an issue on the benchmark repo for improvements to the Q&A section of the benchmark website and if needed on this repo for the FAQ section of the documentation.

I will also share my local evaluation implementation. Maybe also a possible addition to the example section of the documentation.

jonasrauber commented 5 years ago

yes
L2 and MSE can be converted into each other: l2 = sqrt(N * mse), where N = h * w * c, i.e. the size of the image
No! The Q&A is correct, it's not the average.
That's how we always treat this case in Foolbox
None if unpack=False (the default)
see 4., they are 0 and of course must be taken into account (otherwise a bad could misclassify all difficult images)
sounds correct
no, we use a subset for computational reasons (some attacks are too slow)

I agree some formulations e.g. in the FAQ, could be improved. If you want to improve it, feel free to open a PR ;-)

Spenhouet commented 5 years ago

Thank you for your answers. I have some follow up questions.

Ah okay, now I get that part.
Just to confirm: So the total model score is calculated like this:
1. collect the distances of all attacks on all images
2. for every image take the smallest / min. distance (reduce to the worst attack per image)
3. calculate the L2 norm for all distances
4. calculate the median of these L2 norms

6. I'm not sure how that makes sense in regards to your benchmark. That means in addition to robustness your benchmark also measures general performance. A model that would show a new state-of-the-art performance in regards to robustness but doesn't compete with the state-of-the-art in regards to accuracy / error would score bad in your benchmark. In my opinion the general model performance needs to be factored out or else there is no comparability of robustness.

8. By which rules is this subset created? The first x images of the test set? Or other? How can I reproduce / recreate that?

9. If you are not ignoring 0 distances for the model score, you are also not ignoring np.inf, correct?

jonasrauber commented 5 years ago

basically yes, except that you don't really need the conversion from MSE to L2 if you want to compare with values in the plot
no, I disagree… otherwise it's easy to make a model look more robust by making it worse (by changing it from just recognizing to not recognizing something)
it's basically a random subset but not public at the moment
correct, but ideally strong attacks should always find an adversarial (of some possible large size)

Not sure what you are actually trying to achieve. Maybe it helps if I point you to one of our more recent papers [1] that's more representative of they way we think model robustness should be evaluated nowadays. The Robust Vision Benchmark is not exactly the most up to date resource.

[1] https://arxiv.org/abs/1805.09190

jonasrauber commented 5 years ago

Sorry, didn't mean to close, though I guess it's probably all answered?!

Spenhouet commented 5 years ago

3. The Q&A states: "How is the score for a model-attack-pair calculated? [...] The score is given by the median of the L2-norms across images." So I expected the numbers shown in the plot to be the L2-norms. Therefore to compare against the plot I would also need the L2-norms, correct?

5. Oh, that is unfortunate. But for my use case no problem.

Thank you for the link to your paper. I will probably come back to it later.

I created a model that hypothetically could have nice properties in regards to adversarial attacks. Since this property is just a side effect and not the main goal of this model I wanted an easy and time saving way to evaluate it in terms of robustness against a baseline. The time saving begins here with not having to do a deep dive into the current state of adversarial attacks and measuring robustness. As long as this property is just a hypothetical it could be a waste of time if this robustness against adversarial attacks doesn't prove to be true. Therefore foolbox and RVB looked like a good starting point (and foolbox really is!).

Thank you also for pointing me to the more up to date Adversarial Vision Challenge in this issue: https://github.com/bethgelab/robust-vision-benchmark/issues/8

Yes, the issue can be closed. Thank you for your help.

Spenhouet commented 5 years ago

Just to share my current model evaluation implementation (and maybe to get an additional verification that what I'm doing is correct):

def evaluate(dataset, logits, res_dir):
    sess = tf.get_default_session()
    model = foolbox.models.TensorFlowModel(dataset.features, logits, bounds=(0, 1))
    criterion = foolbox.criteria.Misclassification()

    attacks = [
        ('IterativeGradientAttack', foolbox.attacks.IterativeGradientAttack),
        ('GradientAttack', foolbox.attacks.GradientAttack),
        ('DeepFoolAttack', foolbox.attacks.DeepFoolAttack),
        ('LBFGSAttack', foolbox.attacks.LBFGSAttack),
        ('IterativeGradientSignAttack', foolbox.attacks.IterativeGradientSignAttack),
        ('GradientSignAttack', foolbox.attacks.GradientSignAttack),
        ('GaussianBlurAttack', foolbox.attacks.GaussianBlurAttack),
        ('AdditiveGaussianNoiseAttack', foolbox.attacks.AdditiveGaussianNoiseAttack),
        ('AdditiveUniformNoiseAttack', foolbox.attacks.AdditiveUniformNoiseAttack),
        ('SaltAndPepperNoiseAttack', foolbox.attacks.SaltAndPepperNoiseAttack),
        ('ContrastReductionAttack', foolbox.attacks.ContrastReductionAttack)
        ('SinglePixelAttack', foolbox.attacks.SinglePixelAttack)
    ]

    distances = []
    model_performance = []
    for _ in tqdm(range(dataset.test_iterations_per_epoch)):
        images, labels = sess.run([dataset.features, tf.argmax(dataset.labels, axis=1)])

        model_performance.extend(np.equal(np.argmax(model.batch_predictions(images), axis=1), labels))

        for image, label in tqdm(zip(images, labels), total=len(labels)):
            adversarials = [attack(model, criterion)(image, label, unpack=False) for _, attack in tqdm(attacks, total=len(attacks))]
            distances.append([adversarial.distance.value for adversarial in adversarials])

    N = np.prod(dataset.features.get_shape().as_list()[1:])
    l2_norms = np.sqrt(N * distances)

    attack_scores = np.median(l2_norms, axis=0)
    attack_success_rates = np.mean(1 - np.isinf(distances), axis=0)

    total_score = np.median(np.min(l2_norms, axis=1))
    total_success_rate = np.mean(attack_success_rates)

    accuracy = np.mean(model_performance)

As mentioned by @jonasrauber the number of test samples could be reduced due to performance reasons. Of course the list of attacks is variable. I wonder what the current top-5 most important attacks are.

jonasrauber commented 5 years ago

I think the RVB website shows MSE, as the legend says; we sometimes use L2 and MSE almost synonymous, because one can easily be transformed into the other one and at least when using the median, it doesn't matter whether one takes the median first or converts first.

Regarding your code: I'd recommend you save the results of all attacks and calculate aggregated results in a different loop in case something crashes, so you don't need to rerun everything.