Trusted-AI / adversarial-robustness-toolbox

Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams
https://adversarial-robustness-toolbox.readthedocs.io/en/latest/
MIT License
4.63k stars 1.13k forks source link

High adversarial accuracy of TensorflowV2Classifier under Fast Gradient Sign Method (compared to KerasClassifier). #367

Closed JoshuaCN closed 4 years ago

JoshuaCN commented 4 years ago

Describe the bug Hi, I recently attacked an MNIST model with FGSM, when KerasClassifer is used, the attack works properly (acc=16.91%), however, I noticed a high adversarial accuracy (77.6%) and lots of unchanged images when using TensorFlowV2Classifier. I have read a relevant issue #279 ,but it didn't help.

To Reproduce Here is the colab notebook to reproduce the results. https://colab.research.google.com/drive/1ZHEzLy3SRdaZflYOImVFOlWnH8GbWF_v

Keras Classifier is loaded by the following line, classifer = KerasClassifier(model, clip_values=(0, 1))

Tensorflow Classifier is loaded by the following line, classifier = TensorFlowV2Classifier(model, 10, input_shape, clip_values=(0, 1),loss_object=tf.losses.SparseCategoricalCrossentropy())

Attack Function is initialised by the following line, FGM_params = { 'eps': 0.3, 'norm': np.inf, 'batch_size': 200, 'num_random_init': 0 } adversary = FastGradientMethod(classifier, **FGM_params) x_adv = adversary.generate(x_test)

Expected behavior I expected the attack performs similarly across different platforms.

Screenshots TF2Classifier KerasClassifier

System information:

mathsinn commented 4 years ago

Hi @JoshuaCN , thank you for using ART and raising this issue! It seems there is a problem with vanishing gradients: when I compute, with the TensorflowV2Classifier,

clg = classifier.loss_gradient(x_test[k][None,...], np.eye(1,10,y_test[k]))

for a test input k on which the attack fails, it turns out the entries of clg are all 0.0. Thus, FGSM (which uses the loss gradients to compute the adversarial example) will not alter the input and fail. In fact, the test inputs for which this occurs are all classified with class probabilities very close to 1.0, leading to the (almost) vanishing gradients.

On the other side, when computing the loss gradients via the KerasClassifier, they are non-zero (albeit very small, <1e-10).

We will check whether those numerical differences are due to implementation differences on the ART side, or to differences in numerical precision provided by the two frameworks.

beat-buesser commented 4 years ago

Hi @JoshuaCN Thank you very much for using ART and raising this issue! I would be very interested to reproduce and learn more about it.

Could you please let us know which Keras did you use (keras or tensorflow.keras, and which version) and which release candidate version of TensorFlow 2.2.0 did you use?

I was not able to open the Colab link above, can you check if it is working?

JoshuaCN commented 4 years ago

Hi @beat-buesser Thank you for your reply and sorry for my delay, I used tensorflow.keras, and the release candidate version is rc2.

Directly clicking the link Ieads to a blank page, I instead copy the colab link and open it in a new page, which works for me.

Anyway, I attached the file here in case the link is invalid for others.

Untitled1.zip

beat-buesser commented 4 years ago

Thank you very much! This is very helpful.

beat-buesser commented 4 years ago

I have ran experiments with your notebook and I think the difference in behaviour observed is caused by TensorFlow and not by ART. KerasClassifier uses tensorflow.keras.backend.gradients to calculate loss gradients whereas TensorFlowV2Classifier uses tf.GradientTape. It looks like that the softmax activation in the last layer of the model can be treated numerically different in these two methods to calculate gradients. This seems in agreement with currently ongoing discussions on GitHub (https://github.com/tensorflow/tensorflow/issues/32895#issuecomment-614813600 and https://github.com/tensorflow/tensorflow/issues/35585#issuecomment-615413203).

To test this, I have changed the Colab notebook to run with a model predicting logits instead of probabilities and the much earlier appearing vanishing gradients in TensorFlowV2Classifier have disappeared. Both classifiers are now resulting in similarly strong adversarial examples and accuracy.

JoshuaCN commented 4 years ago

Thank you so much! Now I'm getting expected results.

ed1d1a8d commented 3 years ago

To test this, I have changed the Colab notebook to run with a model predicting logits instead of probabilities and the much earlier appearing vanishing gradients in TensorFlowV2Classifier have disappeared. Both classifiers are now resulting in similarly strong adversarial examples and accuracy.

@beat-buesser Do you have an example of the fixed Colab notebook?

thungp commented 2 years ago

@ed1d1a8d , would you have a link to the colab notebook you referenced above?