facebookresearch / radioactive_data

This technique modifies image data so that any model trained on it will bear an identifiable mark.
Other
39 stars 9 forks source link

End-to-end training does not work with CIFAR-10 and Imagenette #3

Open StellaAthena opened 4 years ago

StellaAthena commented 4 years ago

I’ve been trying to get the code to work, and it doesn’t seem to produce the results claimed in the paper. After correcting the issues I raised in #2 the code will run, but it will not produce the correct results. My code can be found here.

I can get the results shown in Table 1 to work (retraining the classifier on the same features) but only if I don’t use augmentations. The rest of the results, from Table 2 onwards simply don’t seem to work. Additionally, when I use augmentations I can’t get the results in Table 1 to work.

For compute reasons I haven’t tried ImageNet yet. However I have used CIFAIR-10 and Imagenette and those two datasets show the same behavior. I don’t expect changing to ImageNet will magically fix everything.

alexandresablayrolles commented 4 years ago

I haven't tried CIFAR-10 nor Imagenette myself, but there might be a difference with Imagenet because of the number of classes. The radioactive signal for each class is usually pretty low, so having a large number of classes helps to get a lower p-value. Also, it might be that the parameters optimised for Imagenet do not work for another dataset, and you need to optimise these as well (typically lambda_ft_l2, lambda_l2_img).

Can you tell me with what percentage of radioactive data you tested your models?

StellaAthena commented 4 years ago

We tried 1%, 2%, 5%, 10%, and 20%. None of those marking levels has a p-value below .01 for the experiments shown in Table 2 on either CIFAR-10 or Imagenette.

alexandresablayrolles commented 4 years ago

Ok. I notice something weird: accuracy seems to decrease (by ~3 points) even though radioactivity is not detected. To continue the investigations, what is the PSNR of the radioactive images? Did you try other architectures (for both the marking & marked networks)? It might be that small datasets like this lead to large variance in the trained networks and so the alignment does not really work.

researcher2 commented 4 years ago

Hello! I've been working with Stella on this. We have run several tests on both the marking and detection code and it looks very reasonable. One conclusion is that the model simply isn't learning the same features when training on marked data, so realigning the basis won't make a difference. My theory is that using full Imagenet saturates the network to the point that it's forced to learn only the most salient features for each class and thus the features end up similar in each run, only with a difference in basis that can be handled through alignment.

PSNR is around 38-45 on Imagenette, it goes down to 28 on CIFAR10. In both cases during marking we get good overall loss curves and the individual elements of the loss function go in their expected directions. Also in both cases the images don't look great after which indicate a fairly strong marking. We have not tried playing with the lambdas as yet, as from visual inspection alone we are getting a strong marking, and the loss curves look good also.

We've only tried resnet18 at this stage, and are considering a full imagenet run to confirm things. When that turns out to work as expected we would probably then try a smaller model which could be saturated by our datasets.

Can you confirm whether you did multi-class marking in your experiments? We initially started with single class and are now doing multi-class, with a 10% experiment sampling 10% of the entire dataset but marking each class individually with it's own carrier vector.

alexandresablayrolles commented 4 years ago

Yes, we did multi-class marking in our experiments, so 10% marking means that for each class, 10% of the instances are marked using the same carrier (but different from one class to the other). This helps a lot to get a low p-value as the evidence compounds across classes.

researcher2 commented 4 years ago

Looks like we have hardware for Imagenet now, did you use the full fall 2011 release or the smaller LSVRC versions?

alexandresablayrolles commented 4 years ago

We used the standard ILSVRC2012 release, with ~1.2M images distributed into 1,000 classes (cf. Section 5.1 of the paper).

StellaAthena commented 4 years ago

Great! We will check back with the results :)

bussfromspace commented 3 years ago

Hi, I had also the same issue. When experimenting with CIFAR10, I get high p values. e.g. for 10%, the p value was -0.729. But using a dataset with high number of classes helps. I also tested the same setup (%10, multi-class marking) with CIFAR100 and get a p value of -3.702 (Still not as low as in ImageNet). I use the test set for the verification.

I also happen to see that there are some unstable results (for p value) when I change the PyTorch version. It seems that with older torch version, I get better p values for verification. What was the exact version you used in experiments?

Setup 1: Python 3.6 with Numpy 1.17, PyTorch 1.4, and torchvision 0.2.1. \ Setup 2: Python 3.8 with Numpy 1.20, PyTorch 1.8, and torchvision 0.9.1. Setup Test acc. log(p)
1, CIFAR10, 10\% 86.18\% -1.946
2, CIFAR10, 10\% 87.38% -0.739
1, CIFAR100, 10\% 60.85\% -8.018
2, CIFAR100, 10\% 61.57% -3.702
alexandresablayrolles commented 3 years ago

Hi, unfortunately my conda environments were wiped out last year so I cannot give you the precise version, but I believe it was Pytorch 1.4 or earlier, as the experiments were run before January 2020. The fact that CIFAR-100 gets higher results than CIFAR-10 definitely makes sense, as does the fact that it is not as low as Imagenet: Imagenet has 10 times more classes. Thanks for reporting these results! If you want to share them with the community, a pull request would be greatly appreciated!

researcher2 commented 2 years ago

Hi Alexandre! We finally ended up doing a center crop (during marking) run on Imagenet + Resnet18 and got a positive detection in Table 2. Our overall results were not as impressive as the paper, do you have the full set of hyperparameters for the published result?

Table1 (Center Crop)

table1_crop

Table 2 (Center Crop)

60 epochs table2_crop

90 epochs: table2_90epochs_crop