NoahVl / Explaining-In-Style-Reproducibility-Study

Re-implementation of the StylEx paper, training a GAN to explain a classifier in StyleSpace, paper by Lang et al. (2021).
Other
35 stars 8 forks source link

Classifier acc #9

Open DavidMrd opened 2 years ago

DavidMrd commented 2 years ago

Hello, I am trying to train a resnet classifier using your code, but the ACC that I am getting is about 0.58. Could you report the ACC that you got? Furthermore, I thing that in the notebook "classifier_training_celeba" in the CelebA class there is a bug as you resize the image twice, first to img_size and after that to 224.

NoahVl commented 2 years ago

Hey! Thanks for taking an interest in our code 😄 Sorry for my late reply, I didn't get notified for some reason.

It seems like we got an accuracy of 87% on the testing set (this was on the age labels, which we didn't use in the end), but I will look into the notebook more tomorrow. Could you tell me what dataset you're training on and what you're trying to predict?

If I remember correctly, the resizing you're describing was done on purpose. This was done because when we're training the StyleGAN2 model, we use smaller images than what the pre-trained ResNet/MobileNet models were trained on (224x224). This means that when we want to classify the (StyleGAN) generated images, we have to upscale the images to 224x224 (we let StyleGAN generate 64x64 images due to computational constraints). To properly capture the noise/artifacts introduced by the interpolation method- we therefore decided to downscale and then upscale the images during the training of the classifier as well. I believe we do this using the bilinear interpolation method, not doing so caused worse performance because it will become pixelated which these pre-trained classifiers don't seem to appreciate.

Our reasoning was that this way of downscaling and upscaling would be less out of distribution and therefore a safer bet. However, you'd have to test this for yourself to be sure. I'm not actually sure if we did.

Also feel free to ask more questions, I'm sorry we didn't properly comment this notebook. It was quickly thrown together because the MobileNet classification was causing a lot of frustration.

NoahVl commented 2 years ago

Also, are you using the same pre-trained ResNet classifier we're using? Or are you trying to train it from scratch?

DavidMrd commented 2 years ago

Hi, thank you for your answer. I am using the pre-trained Resnet classifier. I am trying to predict the gender using the CelebA dataset. I tried to fine-tune/retrain it on the CelebA dataset but the ACC that I got was about 0.57.

NoahVl commented 2 years ago

Hey! So, the accuracy I reported previously was for the age classifier that we didn't end up using. For the ResNet CelebA gender classifier (which we used for generating the explanations of both face models) we got a validation accuracy of 97% and a testing accuracy of 97% as well. I made a new notebook for validating and testing the trained models, so you can see it there too. While testing on the FFHQ dataset then, where we use labels that likely aren't golden truth labels, we get an accuracy of 88% over the whole dataset. Note that we didn't train on this dataset, because the authors did not and we found the labels to be unreliable. But I suppose it is good to know how the model would perform on the data the StyleGAN2 model is trained on, if the labels are at least somewhat reliable.

I also tried re-running the classifier_training_celeba.ipynb notebook (which I updated now to show the results) and got around the same validation accuracy (97%). I did not run it on the testing set, because I saw that unfreezing and training the third to last layer caused the validation performance to drop a bit this time. You could therefore choose to not train that layer and just stop earlier.

All that being said, it shouldn't be the case you get an accuracy of 57% while fine-tuning the PyTorch provided pre-trained model with this notebook. Are you using a smaller batch size that might be the cause of this? Could you try running the training notebook again and see if you get similar performance to ours? You might have changed some things that caused the huge drop in performance.

Do let me know if you try, I'll be happy to help!

DavidMrd commented 2 years ago

Hi, I tested the new notebook and got again a validation and test acc of ~57%. Did you preprocess the images or the labels?

NoahVl commented 2 years ago

I'll clone the repository again and download the data from scratch using the notebook we provided and see if I can replicate your behavior, how strange. I'm certain we didn't change the CelebA images (we did manually filter the plant dataset, because there were some pictures of frogs and houses in there I believe), it could have something to do with the labels but we'll see.

DavidMrd commented 2 years ago

Ok, thank you a lot!

NoahVl commented 2 years ago

After having downloaded everything from scratch and running the classifier_testing_celeba.ipynb code again, I get the exact same results as before, as listed below: image

Maybe something has gone wrong during your Kaggle download? Which causes you to miss some images? I have 202,599 images after I download the CelebA dataset in that folder.

Maybe you can also try cloning this repo from scratch and downloading everything again too, to see if that changes something. I'm not sure what is causing this sadly, but will gladly help if you have questions still :)

NoahVl commented 2 years ago

Ah I see, if I understand correctly you're trying to train your own classifier and want to know what our accuracy was (we should've put this in the paper), but you're not trying to test our model to see if you get the same accuracy on your machine? You're not re-running our training notebook and getting the 57% but you're using your own training script? Sorry for the misunderstanding.

I just quickly looked at your repo and see that when you train you immediately unfreeze all the layers of the pre-trained network. From my experience when fine-tuning these image models, it is usually better to gradually unfreeze the top layers, or only select a few of the top layers to leave unfrozen when fine-tuning (depending on the size of the data, and how close it was to the initial dataset). Have you tried this already? Code for the layer freezing of the classifiers is available in the training notebook.

DavidMrd commented 2 years ago

hi, sorry for the late reply. I was trying to use your code, but somehow the labels and the images are not loaded in the same order. That is why when training or testing, the labels do not correspond to the images. This is the issue that I am experimenting when using your code in my machine.

NoahVl commented 2 years ago

Hmm how odd, are you using Windows? Were you able to fix it?

DavidMrd commented 2 years ago

You can fix it by sorting the image_paths list in theinit methods of the Dataset classes

    image_path = os.path.join(celeb_dir, "img_align_celeba", "img_align_celeba")
    list_sorted = os.listdir(image_path)
    list_sorted.sort()
    self.images = [os.path.join(image_path, file)
                   for file in sorted_list if file.endswith('.jpg')]
NoahVl commented 2 years ago

And now you do get the same accuracy? If you want you can create a merge request and I'll merge it after testing, then you'll have contributed to this repo!

tmabraham commented 2 years ago

I can confirm that I had this same issue and @DavidMrd's fix worked for me (though it should be list_sorted not sorted_list in the last line).

image

tmabraham commented 2 years ago

A similar change is needed for the FFHQ dataset as well.

NoahVl commented 2 years ago

Hey, thanks guys! @DavidMrd let me know if you want to make that merge request, otherwise I'll do it.

I think the reason for this error is that I'm using Windows and you guys are using Linux. Thank you both for testing :)