PKU-ML / AdvNotRealFeatures

Official Code for reproductivity of the NeurIPS 2023 paper: Adversarial Examples Are Not Real Features
13 stars 2 forks source link

Loading CIFAR_ims is failed #4

Closed dohunn closed 4 months ago

dohunn commented 4 months ago

Hello there, I have a issuse on loading robust data.

I downloaded the "dataset.zip" from the Google Drive you provided, but I am getting the error message below when I tried to load robust data using "torch.load('./CIFAR_ims')".

"RuntimeError: unexpected EOF, expected 5647533 more bytes. The file might be corrupted."

However, I was able to successfully load the label (CIFAR_lab). I followed your settings: python=3.8 and pytorch=1.8.1. I've looked at various resources, such as the Pytorch forum, but there doesn't seem to be a good solution for this, so I'm asking.

Best regards.

Charles20021201 commented 4 months ago

Hi, Sorry for the late reply, as there is no email informing us of your issue.

Q1. Dataset corruption I have re-uploaded the dataset to the Google Drive using a different file compressor. Please check it.

Q2. PyTorch Version The configuration is little bit out of date. I think Python==3.8.1 & PyTorch==1.13.1 would be working well.

If there are any further questions, please let us know.

Best, Charles

dohunn commented 4 months ago

@Charles20021201 Thank you for getting back to me. I have confirmed that it loads fine for the dataset you provided, 'dataset_good.tar'.

I'm out of this bug, but I have one more question. When I reproduce Table 2.(a) in your paper, there is a difference between the values reported in the paper. I applied standard supervised learning with ResNet-18 via train_sl.py on the 'dataset_good.tar' you provided. The hyperparameters used the values from Table 6. I observed the following classification accuracies when evaluating the model on the last epoch trained.

Clean: 81.34%, PGD: 18.33%, AutoAttack: 18.44%

PGD (steps=1000) and AutoAttack are based on L2-norm and I set eps to 0.5. I used my implementation and library, for PGD and AutoAttack, respectively. In my observations, the robustness against AutoAttack is slightly higher than PGD, with no significant difference for the two attackers. Isn't 'dataset_good.tar' a dataset created from a robust model trained on the L2-norm? Please let me know if there is something I am missing here.

Best regards.

Charles20021201 commented 4 months ago

Hi, Exciting to hear from your feedback!

There is one bug in the evaluate_robustness.py Change

print(f'robust acc = {correct/5000}')

into

print(f'robust acc = {correct/10000}')

as there are 10000 test images in CIFAR10 rather than 5000. We forget to change this line when releasing the code. We used 5000 images to evaluate robustness for efficiency when conducting the experiments and switch to the entire test set in the released code.

For PGD results, I recommend you using the implementation from torchattacks ( https://adversarial-attacks-pytorch.readthedocs.io/en/latest/), which will lead to similar results to our paper when it comes to PGD.

For AutoAttack results, the number does not make much sense to me since it is very unusual for AutoAttack to be weaker than PGD, with the fact being thoroughly verified in the original paper (https://arxiv.org/pdf/2003.01690).

I would kindly recommend you (1) Use PGD from torchattacks with alpha=0.1 (2) Check whether the AutoAttack package is correctly installed and configured. (3) Convert the images from the robust dataset to torch.uint8 before feeding into the network. (4) Change the attack steps used by AutoAttack to 1000 for more fair comparison.

Please let us know if you have further concerns! We will be eagerly waiting for your reply.

All the best, Charles

Charles20021201 commented 4 months ago

image Here we post a screen shot of running the AutoAttack for evaluating the dataset that I just uploaded with ResNet-18. APGD-CE alone is able to achieve robust accuracy less than 8%

dohunn commented 4 months ago

Thank you for your quick response and support. Sorry, I found a mistake in my implementation where I used models except "model.eval()" and fixed it.

However, even after fixing it, there doesn't seem to be much difference between robustness for PGD and AutoAttack. The results for classification accuracy and robustness(L2-norm and eps = 0.5) are shown below.

Clean: 82.56%, PGD L2: 17.37%, AutoAttack L2("standard"): 16.77%, AutoAttack L2("plus"): 16.74%

I checked the 4 points you mentioned, and they don't seem to be a problem: (1) I observed no difference in robustness between my implementation and torchattacks library; (2) AutoAttack is correctly installed; (3) shouldn't the image inserted by network be torch.float32? (i.e., image \in [0,1]^d); (4) AutoAttack for 100 iterations also seems to be sufficient, because in general, AutoAttack for 100 iterations has a higher attack success rate than PGD with any steps. Instead, I used the "plus" version, which consists of six attacks in a more powerful combination than the default settings provided by AutoAttack.

I also measured robustness(eps = 4 / 255) against PGD and AutoAttack based on Linf-norm, and the results are shown below.

PGD Linf: 11.35%, AutoAttack Linf("standard"): 10.57%

In my opinion, my evaluation code does not seem to have any problems. So I'm guessing that the dataset provided is different from the one used in the paper. Also, I'm not sure you've answered the above questions yet. Is it right that the dataset you gave us was generated from the L2 robust model? I look forward to your answers.

Thank you. Best regards.

Charles20021201 commented 4 months ago

Hi, Thanks for your valuable time spent with us and sharing your results.

Apology to missing the question, the dataset was generated with L2 norm robust classifiers and has slightly difference with the one used in our paper.

We think we have reproduced your AutoAttack results, 15.55%, in my environment with Initializing the adversarial examples from the images in robust dataset rather than from the images in CIFAR10 test-set. May I respectfully ask how you initialize the adversarial examples?

By the way, would you be so kind to run this code, which is essentially train_sl.py + AA evaluation, https://drive.google.com/drive/folders/1GnB5cE-F5dSVqGN9i5z1vb0RvJutFjFc?usp=drive_link and let us know your valuable results? We leave the entire log of running the code here. https://drive.google.com/drive/folders/1GnB5cE-F5dSVqGN9i5z1vb0RvJutFjFc?usp=drive_link

Looking forward to your early reply!

Best, Charles

dohunn commented 4 months ago

Thank you for answering my question.

I looked at my code, but I didn't initialize the adversarial example against the robust dataset and not the CIFAR-10 test set. I used the standard trained ResNet-18 on the robust dataset and measured the robustness of the model against the CIFAR-10 test set.

Thank you for sharing the log and the code. I believe that the code you shared is the code you used in the paper. I noticed that my code so far uses data augmentation to train the model, while the code you shared does not use data augmentation. In the paper, "Table 6: Hyper-parameter configuration of linear probing," it says that you use data augmentation, so I was wondering why you set it differently. I also noticed that the batch size I used was 256, while your code used 512. Isn't the hyper-parameter in this table used to produce the results in Table 2?

Interestingly, I found that the presence or absence of data augmentation makes a big difference in robustness (L2-norm and eps = 0.5). I noticed the following in the logs:

With data augmentation, I obtained 20.51% robust accuracy against AutoAttack, while without data augmentation, I got 9.73%. The improvement from 16.77% to 20.51% seems to be due to the padding in the "transform_train" being set to 1. Previously, it was set to 4. In my view, data augmentation seems to mitigate overfit even on robust datasets. Data augmentation is also used in Table 2 in the appendix of the Ilyas et al. paper.

If you trained the model over a robust dataset without data augmentation, I'd like to know why. I'm sorry if I'm being too pushy and critical, but I'd really appreciate a response. I've attached the code I used as a notepad format: train_sl_and_eval.txt

Ilyas et al. "Adversarial Examples Are Not Bugs, They Are Features"

Thank you. Best regards.

Charles20021201 commented 4 months ago

Hi Dohunn, It has been an inspiring round discussion with you.

The data augmentation in train_sl.py was defined but not used, which seems to be a bug in our code. Using data augmentation during training also leads to robust accuracy of 17% in our environment.

Endless appreciation for pointing this out, and we will soon have a discussion among the authors. We will revise the number reported in our paper with further correction and clarification.

Best, Charles

dohunn commented 4 months ago

Thank you @Charles20021201 for your kind reply. All my concerns have been addressed. I wish you good research in the future.

Best regards.