NoahVl / Explaining-In-Style-Reproducibility-Study

Re-implementation of the StylEx paper, training a GAN to explain a classifier in StyleSpace, paper by Lang et al. (2021).
Other
35 stars 8 forks source link

Issue with reconstruction during training? #11

Closed tmabraham closed 11 months ago

tmabraham commented 2 years ago

I was trying to train the StyleGAN and originally with sample_from_encoder=True I got the following results in the middle of training:

595-from_encoder

this at tick 595, with evaluation every 50 steps like you had as default.

I then turned off sample_from_encoder=False and it looks like it is doing some sort of generation: image

this is at tick 229, with evaluation every 50 steps (top half real images, bottom half fake).

So clearly the generation is fine (apart from maybe some mode collapse, but I can probably solve that with some tuning of the StyleGAN2 parameters).

So does that mean there is something wrong with the encoder training? How can I resolve this issue?

NoahVl commented 2 years ago

First of all, thank you for taking an interest in our code and what a cool dataset you're using!

Are you using our encoder? And if so, do you think it's expressive enough to capture the intricacies of this dataset? Also what accuracy does your classifier get?

The encoder training generated images in its current state should at least generate images that look somewhat similar to the input. You could try a few hundred epochs on the FFHQ dataset, with the ResNet classifier we provided (please use the 64x64 image size), where you can see that the encoder output should start to create realistic shapes pretty quickly. Not these weird histograms like you're experiencing.

Are you using a larger image size than we were? It could be that the encoder is not scaling accordingly. Your batch size is also much larger than ours was, however this should be a good thing I'd think up to a certain degree. We did use gradient accumulation to counteract our memory limits, so it could be that the update steps are the same anyway.

tmabraham commented 2 years ago

@NoahVl I used the default encoder, I am using a Resnet18 binary classifier that got 92% accuracy on the dataset. I am using 256x256 image size, a batch szie of 64, and no gradient accumulation.

I will check out the training on FFHQ dataset....

Do you think I need a different encoder then?

NoahVl commented 2 years ago

You could also first try using an image size of 64x64 on your dataset. Even though that won't produce anything interpretable at that resolution, at least you'll know if the encoder training is doing its job at that resolution. The only annoying thing is that you'll need to retrain your classifier I guess.

You might also be using the old training code judging from your merge request of before. This helped us get better results as you can see in our paper but is not totally what the authors did. You can also try doing some runs with the new code turned on, the boolean you have to change is in cli.py. However I doubt it will help with these encoder results.

It seems that we're using the same architecture for the encoder as we are for the discriminator, so my previous comments about the encoder are probably invalid. We debugged with different types of encoders and they sometimes exhibited behavior similar to what you're presenting as far as I recall.

I hope you can get it resolved easily. Sorry I don't know what exactly is causing the problem. Please let me know if the FFHQ training works on your machine (if you're trying it out) and if your results change at lower resolutions on your dataset (if you're trying that too)!

tmabraham commented 2 years ago

Thanks for your suggestions!

I have tried out 64x64 (retrained classifier to about ~76% accuracy) with no success: 1436-from_encoder

I will try FFHQ training as well.

tmabraham commented 2 years ago

Oh and here is the command I used in case that helps:

CUDA_VISIBLE_DEVICES=1 python cli.py --data '/mnt/tmabraham/data/styleex/MSI' --results_dir '/mnt/tmabraham/data/styleex/results' --models_dir '/mnt/tmabraham/data/styleex/models' --name "MSI-styleex-64" --image_size 64 --batch_size 64 --gradient_accumulate_every 1 --num_workers 16 --classifier_path='../models/MSI_classifier_resnet18_64.pth' --sample_from_encoder True --new True
NoahVl commented 2 years ago

After how many epochs was that result? It indeed looks awful, as if we're sampling from a totally different distribution. Your command looks fine to me so I really wonder what's going on. Maybe it just needs some hyperparameter optimization.

Anyway yes please let me know the results of your FFHQ run!

tmabraham commented 2 years ago

Oops forgot to mention that, it was tick 1436 with again evaluation every 50 steps so I guess after 71800 steps.

NoahVl commented 2 years ago

That should've been more than enough to at least get some pink stripes I think. For the faces we already got some vague shapes after 500 steps I believe. Really wonder if it has to do with the difficulty of the data or that something else is wrong

tmabraham commented 2 years ago

@NoahVl I trained classifier on CelebA and used for training FFHQ StyleEx.

Here is tick 573 every 50 steps so after 28650 steps: 573-from_encoder

Here is the command I used:

CUDA_VISIBLE_DEVICES=1 python cli.py --data '../../Explaining-In-Style-Reproducibility-Study/data/Kaggle_FFHQ_Resized_256px/flickrfaceshq-dataset-nvidia-resized-256px/resized' --results_dir '/mnt/tmabraham/data/styleex/results' --model_dir '/mnt/tmabraham/data/styleex/models' --image_size 64 --batch_size 64 --gradient_accumulate_every 1 --num_workers 16 --classifier_path 'resnet-18-64px-gender-classifier.pt' --sample_from_encoder True --new True

Looks like there is something wrong with how I am using your code?

tmabraham commented 2 years ago

Okay I went back and kept the defaults for the batch size (4) and gradient accumulation (8) and I get this instead (at 110 ticks):

110-from_encoder

So looks like the issue is just bad hyperparameters. I didn't realize it would be so sensitive to the batch size and gradient accumulation. Kind of odd...

I will try with my own dataset as well and see if the issue is resolved.

NoahVl commented 2 years ago

I'm glad you're getting some more sensible results now more quickly. Do you think you'd get the same results when using a batch size of 32? I'm assuming it'd be much more preferable with the speed benefits that come with that.

But yeah I'm very surprised too! Please let me know if the results on your dataset improved.

tmabraham commented 2 years ago

The results on 64x64 images seem to be improved (this is at 851 ticks): 851-from_encoder

The results on 256, not so much: 274-from_encoder

Any thoughts on how to improve this?

NoahVl commented 2 years ago

Oh wow the ones on 64 look great even though they don't really come close to the actual images. We had this with the faces too that the encodings sometimes didn't match the original one. I wonder how it'd change if you train it more.

Also that's unfortunate about the 256 results. I suppose what your previous experiments have shown is that hyperparameter, in this case batch size and gradient accumulation, is really important. Maybe changing that around a bit could help? You could also try changing the learning rate, maybe for differently sized images a different learning rate is more beneficial?

Wish I could be of more help but we also just tuned some parameters in our runs and got okayish results, I'm afraid you'll have to try the same. Now you have an idea of how quickly the encoded images can come closer to the original and not be blobs anymore. It's also expected that at a higher resolution it'll take longer to converge I suppose, but this looks rather extreme again.

NoahVl commented 2 years ago

The authors did mention they used a much higher learning rate than us, they also used larger images, so maybe there's a correlation there. Just guessing though.

NoahVl commented 2 years ago

From the main author of the original paper:

"We trained our models using 8 Nvidia V100 for 250k steps with batch size of 16 (2 per gpu). The dimensionality of the attribute is 2 - we used the final logits of the classifier. We used a learning rate of 0.002 and reconstruction loss both on the image (using LPIPS) and on the W-vector (using L1 loss), both with weight 0.1."

tmabraham commented 2 years ago

Oh wow the ones on 64 look great even though they don't really come close to the actual images. We had this with the faces too that the encodings sometimes didn't match the original one. I wonder how it'd change if you train it more.

Is this an issue in practice for the application of StyleEx. And for the faces, did you observe the encodings match more after training longer?

Thank you for sharing more details about what you did and what the authors did. I will try playing with the learning rate and batch size further.

NoahVl commented 2 years ago

From the model the authors released on the faces, the encodings are actually really good. Way better than we ever got them. So in principle this shouldn't be a downside of the StylEx method. We ascribed this to our limited compute/lower resolution, it could also be that our code isn't perfect sadly. However, even though the encodings didn't match that well, we were still able to recover some classifier attributes, even though they weren't totally disentangled (we also thought this might be because of our hyperparameters/limited compute, but could again be that our script is imperfect).

Best of luck! Would love to hear if you get some nice interpretable explanations on your dataset.

tmabraham commented 2 years ago

I am trying a few settings but with no luck (no grad accumulation, higher and lower LR).

I will ask, what are the changes between stylex_train.py and stylex_train_new.py? The CLI code says:

Only change this to False if you have read the README! Might cause worse training.

but I don't see any note in the README. Why does using the new version cause worse results? I have stayed away from it for this reason but am thinking about giving it a try.

NoahVl commented 2 years ago

Damn I'm sorry trying to change the hyperparameters didn't work. Maybe you could try training on 128x128 instead, (or maybe even 64 since you seemed to get results there) or do you think that resolution will be too low?

The change between the two training scripts is mentioned in this section of our paper: IMG_20220523_085418

I honestly doubt this change will help you with the encoder, however since the authors did have this change and it worked well for them on bigger images you could give it a try!

Maybe it is indeed a good idea to add it to the README, thanks!

Wish I could be of more help but you're always free to ask questions :)

tmabraham commented 2 years ago

I tried using stylex_train_new.py and it looks like there is actually some improvement!!

860-from_encoder

This is after 860 ticks, so 43000 steps. It's taking a while but at least it's looking more like how I expect (H&E pathology images). Still not completely matching the content of the original image. But I can tell there is some association between the input and reconstructed image. For example, there was this one noisy image: 562-from_encoder

I tried also increasing the network capacity with stylex_train.py but it's not training as well: 505-from_encoder

This is from 505 ticks, so after 25250 steps. I'll note that the style_train_new.py also gave results like this at the beginning of training, so maybe this run is even slower.

I will try increased network capacity + stylex_train_new.py and see if that further improves results.

NoahVl commented 2 years ago

I'm surprised but very glad the new training regiment gives you better results! The images look great to an untrained eye like mine, but the correlation between the images is still pretty sparse like you note. I hope this gets fixed by tuning some parameters. I believe in the plant dataset we used, the encoded images looked pretty alike, but not the faces. So I hope it's not a bug in our code.

I forgot that we had a network capacity hyperparameter as well! It comes from the original StyleGAN repo and we never toyed around with it because of our limited compute.

Glad things are progressing! Curious to see if your new tests brings up something interesting.

Maybe once you have a model you think performs alright you can calculate the StyleGAN features that correspond most to changes in classification (it's in one of our notebooks) and then you can see if you can create some sensible explanations already!

tmabraham commented 2 years ago

Hey sorry I keep bombarding you with questions, I'm really motivated to get this working as best as I can 🙂

In your reproducibility paper, and as evidenced by your code, you mention:

The components of the $\mathcal{L{rec}}$ loss were scaled according to authors’ suggestion in our correspondence: 0.1 for $\mathcal{L^x{rec}}$ and $\mathcal{L{LPIPS}}$, 1 for $\mathcal{L^w{rec}}$.

Was this what the authors used for their work too? Or was it a suggestion specific to your work? Did they adjust it per dataset by any chance?

NoahVl commented 2 years ago

Hey, no worries!

The main author told us the following:

We used a learning rate of 0.002 and reconstruction loss both on the image (using LPIPS) and on the W-vector (using L1 loss), both with weight 0.1.

The learning rate didn't work for us, however if I look at your quote now it'd seem that the $\mathcal{L}_{rec}^w$ should be weighted by 0.1. I just looked at the code and it seems like we did do it correctly according to the author's correspondence. So it's likely a mistake in our paper, thanks for spotting it! We'll try to get it changed :)

We had the same question for the main author regarding the scaling of the losses per dataset and he said the following:

Was the scaling of the individual losses tuned to a specific dataset, or was the training found to be stable with the same scalings? Additionally, is it safe to assume the KL loss is not scaled?

We used the same scores for all of the datasets. And yes, we used a scale of 1.0 for the KL loss.

Also this might be of interest to you:

Did you assign a separate learning rate to individual modules? E.g. the mapping network’s lr is scaled by 0.01 in the StyleGAN paper. Our preliminary results show that giving the encoder network a lower learning rate leads to stabler training, but exploring all possible combinations is costly.

No, we haven't tried that - sounds very interesting!

Maybe at different resolutions this can lead to different results, so maybe you want to play around with putting it back to the same learning rate as the rest of the network. Could explain part of your encoder issues.

tmabraham commented 2 years ago

Thank you for the extra details!

Scaling the LR of the encoder network seems like an interesting idea. By the way, it's not clear to me where the mapping network's LR is being scaled both in your code and in the StyleGAN2-ADA-pytorch codebase, so I am actually not sure if that is happening in the code.

I'll also try scaling the reconstruction loss, maybe that will help in some way...

NoahVl commented 2 years ago

I think with mapping network we might've meant the network that maps the outputs of the encoder into stylespace by doing the affine transformation (linear layer). You can control this learning rate by changing lr_mlp. It scales the weights such that the gradients will be lower for the MLP than for the rest of the network:

class EqualLinear(nn.Module):
    def __init__(self, in_dim, out_dim, lr_mul=1, bias=True):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_dim, in_dim))
        if bias:
            self.bias = nn.Parameter(torch.zeros(out_dim))

        self.lr_mul = lr_mul

    def forward(self, input):
        return F.linear(input, self.weight * self.lr_mul, bias=self.bias * self.lr_mul)

There's also ttur_mult in the cli file which scales the learning rate of the discriminator. I don't think there's an option to scale the lr of the encoder, you could try implementing it by optimizing the parameters seperately but that might lead to instabilities, I'm not sure.

# init optimizers
generator_params = list(self.G.parameters()) + list(self.S.parameters()) + list(self.encoder.parameters())
self.G_opt = Adam(generator_params, lr=self.lr, betas=(0.5, 0.9))
self.D_opt = Adam(self.D.parameters(), lr=self.lr * ttur_mult, betas=(0.5, 0.9))

Sorry for my belated response, I'm quite busy with my own projects but feel free to ask me more questions or let me know how your project is progressing!

tmabraham commented 2 years ago

Why is encoder hard-coded to an LR of 1e-5 in stylex_train_new.py? Did you notice stabler training with that? What happened when the LR was higher? https://github.com/NoahVl/Explaining-In-Style-Reproducibility-Study/blob/8222a107ac66a1cf5935b60961119a9e818bc7df/stylex/stylex_train_new.py#L968

rtadijar commented 2 years ago

Higher learning rates on the encoder gave us unstable training, iirc. We had to lower it quite a bit to get training going.