Training a 'cartoonify' model with unpaired data.

snakch commented 3 years ago

Hi, I wonder if you can help me.

Basically I'd like to train a model similar to the Toonify model, except on a different target domain (I'm going for a more hand-drawn cartoony style) with unpaired data - about 1000 examples.

I've tried training a model starting with the ffhq_cartoon_blended weights for about 12000 steps (batch size 4) and the recommended toonify hyper parameters. However, the output don't look very good, they look like an overlaid version of the 'defauly' toonify face and the target image. (See below) I wonder if you have advice for more succesful training. Thanks!

0007_12000

yuval-alaluf commented 3 years ago

Since your domain is very different from the toonfiy domain, I don't think that using the toonify generator will be useful here. You should consider training a StyleGAN generator on your domain and training pSp with the resulting generator.

snakch commented 3 years ago

Thank you, that makes sense.

One thing I'm a little unsure about then is: if I train a StyleGAN2 generator on this domain, can I not perform the task of mapping from real faces to cartoon faces just with the vanilla StyleGAN architecture? What benefits does the psp framework bring?

yuval-alaluf commented 3 years ago

In order to map a given image into the StyleGAN latent space you need to invert the image (i.e. convert it from an image into its latent code). pSp allows you to do just that, even when the source and target domain are different (which in this case is true since you're mapping from real images to cartoon images). There are other works that use direct latent vector optimization to invert the image. However, this will typically not work when trying to map images between two different domains.

snakch commented 3 years ago

Gotcha, my confusion arose from a too hasty reading of Justin Pinkney's blog post for the original StyleGAN Toonify model. If I understand correctly, they achieve "toonification" by training an encoder and then invert images through essentially by running gradient descent to find the noise variable that generates the target image which is obviously very expensive. Am I correct in saying then that pSp achieves the same quality (or similar) for much cheaper?

Also as an aside, do I take it that the easiest way for training my own generator is to use https://github.com/rosinality/stylegan2-pytorch ?

yuval-alaluf commented 3 years ago

Yea, pSp can be used to perform the translation in about 0.1 seconds with no per-image optimization and no need for paired data.

Also as an aside, do I take it that the easiest way for training my own generator is to use https://github.com/rosinality/stylegan2-pytorch ?

There are a couple of options for training a generator:

Using the rosinality code directly. The resulting generator should be directly compatible with the pSp repo so no additional steps should be needed.
Using NVIDIA's TensorFlow StyleGAN2 or StyleGAN-ADA implementation and using the conversion script in rosinality's repo to get the Pytorch weights, which are then compatible with our repo.
Using NVIDIA's PyTorch StyleGAN-ADA implementation and using Justin Pinkney's conversion script which he linked in this issue: https://github.com/eladrich/pixel2style2pixel/issues/104#issuecomment-803886944

snakch commented 3 years ago

Thank you so much for taking the time to help :)

snakch commented 3 years ago

Hi again, so I've trained my own stylegan generator on my domain. It creates decent looking samples. Now I'm using it to train pSp on ffhq in the hope of getting a real_face -> cartoon_face model. However the results still look quite bad after many training steps.

Here are some details:

I used NVIDIA's Pytorch StyleGAN ADA to train StyleGAN.
I finetuned the FFHQ 256 dataset on a dataset of my own 256 x 256 images.
After struggling a little and following conversations on github Issues, I finally converted the weights using the scribt on dschultz'fork: https://github.com/dvschultz/stylegan2-ada-pytorch/blob/main/SG2_ADA_PT_to_Rosinality.ipynb
I needed to modify a couple of things: in particular I needed to change n_multipliers to 1
I also made sure that the 'latent_avg' variable was converted

Finally, I trained pSp with this generator, changing channel_multiplier to 1 in the decoder. I train it on FFHQ resized to 256 px and the results still look somewhat terryfying after about 20k (they look bad all the way)

1400

Can you think of anything I'm definitely doing wrong? Is the above procedure somewhat close to how the Toonify model would have been trained? Thanks!

yuval-alaluf commented 3 years ago

the results still look somewhat terryfying after about 20k (they look bad all the way)

Those results do indeed look terrifying.
I am wondering if other than the change to the channel multiplier, you also changed the output resolution of the Generator. For example,

Generator(256, 512, 8, channel_multiplier=1)

Also, in the pSp encoder definition we have: https://github.com/eladrich/pixel2style2pixel/blob/0c83c42a913adc42d0ba0dabfa7d5b25b8f10ffd/models/encoders/psp_encoders.py#L57-L59 Notice how we hard-coded the number of style vectors to 18. In your case, since you have a 256x256 generator, you should change this to 14. While I don't think this will make a difference, it's always good to verify.

snakch commented 3 years ago

Thanks for the suggestions! I did not change the output resolution, not my decoder looks like

Generator(1024, 512, 8, channel_multiplier=1)

(To be honest with the different names of variables in rosinality versus NVIDIA's implementation, it was a little hard to figure out how to get the right dimensions to work out...)

When I have a bit more time later I'll try to perform the changes above and see what I get.

snakch commented 3 years ago

Yeah so just to confirm, trying to change the output dimension of the generator leads to errors (the state dict has keys which aren't expected).

I'll try retraining with 14 style count and see if that makes a difference. Do let me know if you have other suggestions!

yuval-alaluf commented 3 years ago

Before you retrain your generator again, did you make the changes to the self.style_count variable? What keys aren't expected? This could probably be solved without retraining your generator which would probably take a long time and may be unncessary.

snakch commented 3 years ago

I've changed self.style_count to 14 and encountered a few issues: it seemed that there was a mismatching number of self.convs layers in /models/stylegan2/model.py

I've launched a run where I ignore the later layers (I also had to ignore the later layers in the latent_avg variable). Let's see where that goes

I feel like that won't be it however, as far as I can tell, even though I trained on a 256x256 dataset, it has 18 style layers...

yuval-alaluf commented 3 years ago

Got. It's weird that you have a 256x256 generator with 18 style layers.
In any case, once you're able to match all the generator parameters with the pSp parameters, I believe you'll be getting much better results.
I'm closing this issue for now, but feel free to open it if you have any other questions.

snakch commented 3 years ago

Once more I come hat in hand with some questions. I think the parameters are working, my results look more believable now. However it seems that the model is learning to map to the same face always. It's correctly learning about things like lighting of the face, complexion, hair colour, and maybe jaw shape but not facial identity or face angle.

My hyperparameters are:

"--workers=8",
"--batch_size=4",
"--max_steps=21000",
"--test_batch_size=8",
"--test_workers=8",
"--val_interval=1000",
"--save_interval=3000",
"--encoder_type=GradualStyleEncoder",
"--start_from_latent_avg",
"--lpips_lambda=0.4",
"--id_lambda=1.0",
"--w_norm_lambda=0.02",
"--l2_lambda=1",
"--board_interval=100",

I've tried messing with w_norm_lambda (tried 0.025, 0.02 and 0.01) a little as well as lpips_lambda (tried 0.4 and 0.8)

Can you see if I'm missing something obvious? Maybe the identity loss is struggling to do its job properly? I'll attach here an example output, but I can attach more if useful, eg. to demonstrate the progress of training. This is after ~17k steps of the above configuration:

PS: Once again, thank you so much for your help, it's really cool to be able to use your awesome codebase for my own little project :) 0000_17000

PPS: Hopefully you'll see this, I tried reopening the issue but I don't think I can since you closed it last

yuval-alaluf commented 3 years ago

Hi @snakch , The results definitely look better, but not close to what we'd want to see :smile:
I think your configuration seems good. I think the challenge here performing the translation without paired data perhaps. For the toonify task, we were able to translate the images without paired data, but I wonder if maybe here it is more difficult to do so because of your generator? Correct me if I am wrong, but you took an FFHQ generator and fine-tuned the generator to your domain. This is a bit different than how Justin Pinkney built his toonify generator. If I remember correctly, they train a generator from scratch on toon images and then perform layer swapping between the toonify and FFHQ generators to obtain the blended generator. Maybe you can perform a similar process? Generally, the more your generator's latent space is different than FFHQ's latent space, the more difficult this translation will be, so you will want to make sure the two generators are relatively well-aligned.

Other than that, this is a bit beyond the scope of what we tried to do, but I hope this can be of help.

snakch commented 3 years ago

Yes, you're exactly right, and that sounds like the missing ingredient indeed. I'll try to perform the same thing.

So just to make sure I understand the process is:

1) train StyleGAN2 generator on your domain (using transfer learning if nec.) 2) perform layer blending with human domain (let's say FFHQ) 3) train pSp encoder using the blended generator on the inversion task, again using a human domain dataset like FFHQ.

correct?

Thank you !

yuval-alaluf commented 3 years ago

Sorry, I missed your response! You understood the process correctly. I hope that since the two generators will be more aligned, you will hopefully get more realistic results.

snakch commented 3 years ago

No worries, I should also close the issue for now - I think that was the real issue.

I'm realising that the difference in domains is quite large now, probably larger than for the toonify model so I have to tinker with the generator and layer blending, but should hopefully get there!

eladrich / pixel2style2pixel

Training a 'cartoonify' model with unpaired data. #105