Closed Spenhouet closed 5 years ago
Unfortunately, this does not yet exist. For the robust vision benchmark, we had a special setup customized to our GPU cluster to run the evaluations on many GPUs in parallel. In the end, what you need to do is to loop over the attacks and the images.
I have a first implementation but I'm not sure if I'm doing everything right. The documentation is somewhat confusing.
If I run an attack with unpack=True
then I get the adversarial
object.
Do I understand correct, that the distance
in the adversarial
object is the MSE between input and output of the smallest perturbation image?
I don't see where the l2 norm fits in? Where do I have to apply it?
The total model score is the average of all attack scores? (if yes, than the Q&A section is very misleading)
If the models classification is wrong (without any perturbation) then the distance
is 0? (I could find this in the FAQ but it is badly worded)
The FAQ also mentions that if no adversarial was found, then "the attack will return None". This looks incorrect to me. From my testing the attack still returns an adversarial
object with a distance
of np.inf
. What object should be None
?
How to handle misclassifications (wrong classification without any perturbation) for the benchmark? I opted to ignore all these for the median calculation. Is this correct?
The attack success rate is np.mean(1 - np.isinf(distances))
for a single attack on all test images?
The benchmark is performed on all test images of the respective dataset (for example on MNIST the 10k test images)?
Sorry for the overlap of these questions to the benchmark project.
When the above questions are answered, I will create an issue on the benchmark repo for improvements to the Q&A section of the benchmark website and if needed on this repo for the FAQ section of the documentation.
I will also share my local evaluation implementation. Maybe also a possible addition to the example section of the documentation.
l2 = sqrt(N * mse)
, where N = h * w * c
, i.e. the size of the imageNone
if unpack=False
(the default)I agree some formulations e.g. in the FAQ, could be improved. If you want to improve it, feel free to open a PR ;-)
Thank you for your answers. I have some follow up questions.
Ah okay, now I get that part.
Just to confirm: So the total model score is calculated like this:
6. I'm not sure how that makes sense in regards to your benchmark. That means in addition to robustness your benchmark also measures general performance. A model that would show a new state-of-the-art performance in regards to robustness but doesn't compete with the state-of-the-art in regards to accuracy / error would score bad in your benchmark. In my opinion the general model performance needs to be factored out or else there is no comparability of robustness.
8. By which rules is this subset created? The first x images of the test set? Or other? How can I reproduce / recreate that?
9. If you are not ignoring 0 distances for the model score, you are also not ignoring np.inf
, correct?
Not sure what you are actually trying to achieve. Maybe it helps if I point you to one of our more recent papers [1] that's more representative of they way we think model robustness should be evaluated nowadays. The Robust Vision Benchmark is not exactly the most up to date resource.
Sorry, didn't mean to close, though I guess it's probably all answered?!
3. The Q&A states: "How is the score for a model-attack-pair calculated? [...] The score is given by the median of the L2-norms across images." So I expected the numbers shown in the plot to be the L2-norms. Therefore to compare against the plot I would also need the L2-norms, correct?
5. Oh, that is unfortunate. But for my use case no problem.
Thank you for the link to your paper. I will probably come back to it later.
I created a model that hypothetically could have nice properties in regards to adversarial attacks. Since this property is just a side effect and not the main goal of this model I wanted an easy and time saving way to evaluate it in terms of robustness against a baseline. The time saving begins here with not having to do a deep dive into the current state of adversarial attacks and measuring robustness. As long as this property is just a hypothetical it could be a waste of time if this robustness against adversarial attacks doesn't prove to be true. Therefore foolbox and RVB looked like a good starting point (and foolbox really is!).
Thank you also for pointing me to the more up to date Adversarial Vision Challenge in this issue: https://github.com/bethgelab/robust-vision-benchmark/issues/8
Yes, the issue can be closed. Thank you for your help.
Just to share my current model evaluation implementation (and maybe to get an additional verification that what I'm doing is correct):
def evaluate(dataset, logits, res_dir):
sess = tf.get_default_session()
model = foolbox.models.TensorFlowModel(dataset.features, logits, bounds=(0, 1))
criterion = foolbox.criteria.Misclassification()
attacks = [
('IterativeGradientAttack', foolbox.attacks.IterativeGradientAttack),
('GradientAttack', foolbox.attacks.GradientAttack),
('DeepFoolAttack', foolbox.attacks.DeepFoolAttack),
('LBFGSAttack', foolbox.attacks.LBFGSAttack),
('IterativeGradientSignAttack', foolbox.attacks.IterativeGradientSignAttack),
('GradientSignAttack', foolbox.attacks.GradientSignAttack),
('GaussianBlurAttack', foolbox.attacks.GaussianBlurAttack),
('AdditiveGaussianNoiseAttack', foolbox.attacks.AdditiveGaussianNoiseAttack),
('AdditiveUniformNoiseAttack', foolbox.attacks.AdditiveUniformNoiseAttack),
('SaltAndPepperNoiseAttack', foolbox.attacks.SaltAndPepperNoiseAttack),
('ContrastReductionAttack', foolbox.attacks.ContrastReductionAttack)
('SinglePixelAttack', foolbox.attacks.SinglePixelAttack)
]
distances = []
model_performance = []
for _ in tqdm(range(dataset.test_iterations_per_epoch)):
images, labels = sess.run([dataset.features, tf.argmax(dataset.labels, axis=1)])
model_performance.extend(np.equal(np.argmax(model.batch_predictions(images), axis=1), labels))
for image, label in tqdm(zip(images, labels), total=len(labels)):
adversarials = [attack(model, criterion)(image, label, unpack=False) for _, attack in tqdm(attacks, total=len(attacks))]
distances.append([adversarial.distance.value for adversarial in adversarials])
N = np.prod(dataset.features.get_shape().as_list()[1:])
l2_norms = np.sqrt(N * distances)
attack_scores = np.median(l2_norms, axis=0)
attack_success_rates = np.mean(1 - np.isinf(distances), axis=0)
total_score = np.median(np.min(l2_norms, axis=1))
total_success_rate = np.mean(attack_success_rates)
accuracy = np.mean(model_performance)
As mentioned by @jonasrauber the number of test samples could be reduced due to performance reasons. Of course the list of attacks is variable. I wonder what the current top-5 most important attacks are.
Regarding your code: I'd recommend you save the results of all attacks and calculate aggregated results in a different loop in case something crashes, so you don't need to rerun everything.
It is not clear to me how to run the same evaluation like it is done in the Robust Vision Benchmark.
In this benchmark you show scores for every model and multiple attacks. Is there any usable implementation to run the exact same evaluation?
What I would like is to just run something similar to the following:
To then get the same evaluation results like on the benchmark results in a python object: