Questions about optimal values of the lambdas, data split and epsilon values

mohanhanmo commented 3 years ago

Hi Dharma,

Your work and code look amazing to me, so I was trying to repeat your experiment. I basically can run the model training through, but about the detailed parameter optimization and validation, I got following questions:

In split.py, only a quarter of all data were saved for training, validation and testing, by
- for img_name in train_images[::4]:
- for img_name in train_val_images[::4]:
- for img_name in val_images[::4]:
- for img_name in test_images[::4]:

which gave me around 1590 images for training and around 530 for each testing and validation. However, when I trained and validated the model with a quarter of all data like this, I could not get the same accuracy as you listed in the paper. So I changed the code into following:

for img_name in train_images:

for img_name in train_val_images:

for img_name in val_images:

for img_name in test_images:

which gave me around 5914 training data, and around 1971 for each testing and validation. Then I got the similar validation accuracy as you showed in the paper. Did I do the correct thing? Please let me know if it was wrong, thank you very much!

What were your optimal choices of lambda_1 and lambda_2 with different epsilons, please? I tried lambda_1=lambda_2=1 when epsilon=0 using 5914 training images, from which I can get similar accuracy as it is in your paper (around 75%). But it would be awesome if you could let me know more details about the optimal choices of lambdas with different epsilon values.
I also validated the pretrained "normal" model (train_method = 'normal') with different epsilons (adversarial perturbation radii). However, I could not get any similar accuracy as in your paper. For example, if I validate the pretrained "normal" model with epsilon=0.175 , the validation accuracy I got was only around 30%, while in the paper the validation accuracy of epsilon=0.175 should be around 52% for the "normal" model. Same thing happened to the "bbox" model as well, where I got a 35% validation accuracy using epsilon=0.175 and lambda_1=lambda_2=1, but in the paper the validation accuracy of "lambda equal" model with epsilon=0.175 should be around 65%. However, when I validated the model with epsilon=0.0025, I can get a similar results corresponding to that of epsilon=0.175 in your paper. The following is the code I used for the robust accuracy validation, could you please kindly let me know if there is anything wrong?
```
import torch
import torch.nn as nn
from torchvision import transforms
import foolbox as fb
import numpy as np
from torchvision import datasets, models
import pickle
import matplotlib.pyplot as plt
```

model_path = "/results/resnet50/normal_1_1.pth" val_dataset_path = '/data/val' epsilon = 0.175

num_classes = 200 device = torch.device('cuda')

transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ])

val_dataset = datasets.ImageFolder(val_dataset_path, transform) val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=64, shuffle=False, num_workers=0)

bounds = (0, 1) print(f'Running Attacks...') model = models.resnet50(pretrained=False) input_features = model.fc.in_features model.fc = nn.Linear(input_features, num_classes) model.load_state_dict(torch.load(model_path))

model.eval() fmodel = fb.PyTorchModel(model, bounds=bounds) attack = fb.attacks.FGSM()

robust_acc_list = [] for inputs, labels in valloader: inputs, labels = inputs.to(device), labels.to(device) , _, is_adv = attack(fmodel, inputs, labels, epsilons=epsilon) robust_acc = 1 - is_adv.float().mean(axis=-1) robust_acc_list.append(robust_acc.cpu().numpy())

avg_acc = np.mean(robust_acc_list) print(avg_acc)



Thank you in advance for your kind help and time!!!

ck-amrahd commented 3 years ago

Hi Mo, Thank you for your interest in our paper.

Yes, you are correct. it should be train_images, I was experimenting with the smaller dataset and forgot to update to previous values.
For lambda_1 and lambda_2, we selected them by training multiple models as mentioned in the paper and selecting the model that performs best for a given value of epsilon.
validation set was used to select models among different trained models and the accuracies mentioned in the paper are for the test dataset. I hope it clears your confusion. Thank you.

mohanhanmo commented 3 years ago

Hi Dharma,

Thank you very much for your quick response! Your answers are really helpful and I appreciate it a lot. Continuing with my questions:

Thank you very much for your clarification!
I understand this part that the lambda_1 and lambda_2 were selected by grid search according to the performance of each parameter pair, and we may have different optimal results of lambda_1 and lambda_2 since the dataset was split randomly in each of our side. But I was wondering what is the best selection of lambda_1 and lambda_2 in your side when epsilon=0, so that I could compare them with the optimal selection on my end to see how different the optimal selections could be for different data split.
For the validation accuracies with different epsilons, I also tested the model on the test dataset, and the results were similar with those of the validation dataset (around 30% accuracy for the normal model case with epsilon=0.175), since the dataset was split randomly into testing and validation set without any specification. Is there any other possible reason that I could not get similar accuracies with different epsilons as the curve in the paper, please?

Thank you very much for your kind help!!

ck-amrahd commented 3 years ago

Hi Mo, I looked at the results and it seems like lamnda_1=1 and lambda_2=4.64 gave good values for epsilon equals 0 [for bbox training]. For a normal model, we should use lamnda_1=0 and lambda_2=0 [that's normal CNN training where we don't do any penalization]. I used the train_val set to select the best model during training and then use the val dataset to select lambda_1 and lambda_2. If you use lambda_1=0 and lambda_2=0 and do normal training, it should produce test acc around 50% on the test set with epsilon=0.175, which is the value that you can see in the graph in the paper. Thank you.

mohanhanmo commented 3 years ago

Hi Mo, I looked at the results and it seems like lamnda_1=1 and lambda_2=4.64 gave good values for epsilon equals 0 [for bbox training]. For a normal model, we should use lamnda_1=0 and lambda_2=0 [that's normal CNN training where we don't do any penalization]. I used the train_val set to select the best model during training and then use the val dataset to select lambda_1 and lambda_2. If you use lambda_1=0 and lambda_2=0 and do normal training, it should produce test acc around 50% on the test set with epsilon=0.175, which is the value that you can see in the graph in the paper. Thank you.

Thank you so much for your suggestion! I will try them out. For now I will close the issue.

ck-amrahd / birds

Questions about optimal values of the lambdas, data split and epsilon values #1