Harry24k / adversarial-attacks-pytorch

PyTorch implementation of adversarial attacks [torchattacks].
https://adversarial-attacks-pytorch.readthedocs.io/en/latest/index.html
MIT License
1.79k stars 337 forks source link

[BUG] Applying attacks on 2 instances of the same model gives too far away results #178

Open talrub opened 3 months ago

talrub commented 3 months ago

✨ Short description of the bug [tl;dr]

I have conducted experiments with two instances of the same "4-layer-VanillaRNN-none_activation_between_layers" model , which I'll denote as model1 and model2. Both models were trained on the MNIST dataset, where each input image is treated as 28 sequences of length 28 pixels.

Model1 was trained with lr=0.001 until reaching train_acc=100% (train_loss=0.0003 val_acc=95.38%). Model2 was trained with lr=0.005 until reaching train_acc=100% (train_loss=0.00018 val_acc=95.25%).

Subsequently, I generated 1000 adversarial images for both models using CW and PGD attacks.

For model1 i got: test_acc_on_1000_real_samples=95.2% test_acc_on_1000_adversarial_samples_CW=30.9% test_acc_on_1000_adversarial_samples_PGD=2.2%

For model2 i got: test_acc_on_1000_real_samples=93.3% test_acc_on_1000_adversarial_samples_CW=60.9% test_acc_on_1000_adversarial_samples_PGD=24.10%

Despite both models sharing the same architecture and achieving comparable training and validation accuracies, the significant discrepancies in the results of the same attacks raise concerns.

To delve deeper, I conducted the attacks using five different seeds for each model. The averaged results, along with standard deviations, are as follows:

Model1: avg_ CW_tes_accrobustness=30.9% std CW_tes_accrobustness=0 avg PGD_tes_accrobustness=2.6% std PGD_tes_acc_robustness=0.41

Model2: avg_ CW_tes_accrobustness=60.9% std CW_tes_accrobustness=0 avg PGD_tes_accrobustness=24.3% std PGD_tes_acc_robustness=0.8

We can see that using multiple seeds doesn't settle this issue and we can observe that CW implementation is deterministic while PGD is not.

I am attaching below a link to a Google Drive directory containing my Google Colab notebook and the models weights files. I would appreciate getting advice on this issue.

Thanks and sorry for the long description.

💬 Detailed code and results

Link to the Google Drive directory:

https://drive.google.com/drive/folders/1-msQmKOwjEbzHSCRwx7PculSOthxbHLA?usp=sharing

rikonaka commented 3 months ago

Hi @talrub , attack algorithms such as PGD and CW are proposed based on CNNs, not RNNs, so your attacks on RNNs can lead to some unexpected problems.

Many papers about adversarial machine learning have focused on convolutional neural nets (CNNs) as benchmarked in [4], but relatively few have considered RNNs/LSTMs. Though some similarities exist between attacks on CNNs and RNNs, RNNs’ discrete and sequential nature poses added challenges in generating and interpreting adversarial examples. ... This basically considers a inear approximation of the loss function around the inputs. Due to the discrete nature of RNNs, such an approach doesn’t directly work. However, we can use the same intuition to find discrete modifications to the inputs that roughly align with the gradient of the loss function. [15] showed that the Jacobian Saliency Map Approach, though initially developed for feed-forward networks, can be generalized to RNNs. [16] extended this approach to generate adversarial examples using Generative Adversarial Networks (GANs).

Paragraphs from https://web.stanford.edu/~bartolo/assets/crafting-rnn-attacks.pdf related work.

So according to the above paper, you can try to use JSMA attack instead of PGD or CW attack for the RNNs 😘.