Closed snakch closed 3 years ago
Since your domain is very different from the toonfiy domain, I don't think that using the toonify generator will be useful here. You should consider training a StyleGAN generator on your domain and training pSp with the resulting generator.
Thank you, that makes sense.
One thing I'm a little unsure about then is: if I train a StyleGAN2 generator on this domain, can I not perform the task of mapping from real faces to cartoon faces just with the vanilla StyleGAN architecture? What benefits does the psp framework bring?
In order to map a given image into the StyleGAN latent space you need to invert the image (i.e. convert it from an image into its latent code). pSp allows you to do just that, even when the source and target domain are different (which in this case is true since you're mapping from real images to cartoon images). There are other works that use direct latent vector optimization to invert the image. However, this will typically not work when trying to map images between two different domains.
Gotcha, my confusion arose from a too hasty reading of Justin Pinkney's blog post for the original StyleGAN Toonify model. If I understand correctly, they achieve "toonification" by training an encoder and then invert images through essentially by running gradient descent to find the noise variable that generates the target image which is obviously very expensive. Am I correct in saying then that pSp achieves the same quality (or similar) for much cheaper?
Also as an aside, do I take it that the easiest way for training my own generator is to use https://github.com/rosinality/stylegan2-pytorch ?
Yea, pSp can be used to perform the translation in about 0.1 seconds with no per-image optimization and no need for paired data.
Also as an aside, do I take it that the easiest way for training my own generator is to use https://github.com/rosinality/stylegan2-pytorch ?
There are a couple of options for training a generator:
Thank you so much for taking the time to help :)
Hi again, so I've trained my own stylegan generator on my domain. It creates decent looking samples. Now I'm using it to train pSp on ffhq in the hope of getting a real_face -> cartoon_face model. However the results still look quite bad after many training steps.
Here are some details:
Finally, I trained pSp with this generator, changing channel_multiplier to 1 in the decoder. I train it on FFHQ resized to 256 px and the results still look somewhat terryfying after about 20k (they look bad all the way)
Can you think of anything I'm definitely doing wrong? Is the above procedure somewhat close to how the Toonify model would have been trained? Thanks!
the results still look somewhat terryfying after about 20k (they look bad all the way)
Those results do indeed look terrifying.
I am wondering if other than the change to the channel multiplier, you also changed the output resolution of the Generator. For example,
Generator(256, 512, 8, channel_multiplier=1)
Also, in the pSp encoder definition we have: https://github.com/eladrich/pixel2style2pixel/blob/0c83c42a913adc42d0ba0dabfa7d5b25b8f10ffd/models/encoders/psp_encoders.py#L57-L59 Notice how we hard-coded the number of style vectors to 18. In your case, since you have a 256x256 generator, you should change this to 14. While I don't think this will make a difference, it's always good to verify.
Thanks for the suggestions! I did not change the output resolution, not my decoder looks like
Generator(1024, 512, 8, channel_multiplier=1)
(To be honest with the different names of variables in rosinality versus NVIDIA's implementation, it was a little hard to figure out how to get the right dimensions to work out...)
When I have a bit more time later I'll try to perform the changes above and see what I get.
Yeah so just to confirm, trying to change the output dimension of the generator leads to errors (the state dict has keys which aren't expected).
I'll try retraining with 14 style count and see if that makes a difference. Do let me know if you have other suggestions!
Before you retrain your generator again, did you make the changes to the self.style_count
variable? What keys aren't expected? This could probably be solved without retraining your generator which would probably take a long time and may be unncessary.
I've changed self.style_count
to 14 and encountered a few issues: it seemed that there was a mismatching number of self.convs layers in /models/stylegan2/model.py
I've launched a run where I ignore the later layers (I also had to ignore the later layers in the latent_avg
variable). Let's see where that goes
I feel like that won't be it however, as far as I can tell, even though I trained on a 256x256 dataset, it has 18 style layers...
Got. It's weird that you have a 256x256 generator with 18 style layers.
In any case, once you're able to match all the generator parameters with the pSp parameters, I believe you'll be getting much better results.
I'm closing this issue for now, but feel free to open it if you have any other questions.
Once more I come hat in hand with some questions. I think the parameters are working, my results look more believable now. However it seems that the model is learning to map to the same face always. It's correctly learning about things like lighting of the face, complexion, hair colour, and maybe jaw shape but not facial identity or face angle.
My hyperparameters are:
"--workers=8",
"--batch_size=4",
"--max_steps=21000",
"--test_batch_size=8",
"--test_workers=8",
"--val_interval=1000",
"--save_interval=3000",
"--encoder_type=GradualStyleEncoder",
"--start_from_latent_avg",
"--lpips_lambda=0.4",
"--id_lambda=1.0",
"--w_norm_lambda=0.02",
"--l2_lambda=1",
"--board_interval=100",
I've tried messing with w_norm_lambda
(tried 0.025, 0.02 and 0.01) a little as well as lpips_lambda
(tried 0.4 and 0.8)
Can you see if I'm missing something obvious? Maybe the identity loss is struggling to do its job properly? I'll attach here an example output, but I can attach more if useful, eg. to demonstrate the progress of training. This is after ~17k steps of the above configuration:
PS: Once again, thank you so much for your help, it's really cool to be able to use your awesome codebase for my own little project :)
PPS: Hopefully you'll see this, I tried reopening the issue but I don't think I can since you closed it last
Hi @snakch ,
The results definitely look better, but not close to what we'd want to see :smile:
I think your configuration seems good. I think the challenge here performing the translation without paired data perhaps.
For the toonify task, we were able to translate the images without paired data, but I wonder if maybe here it is more difficult to do so because of your generator?
Correct me if I am wrong, but you took an FFHQ generator and fine-tuned the generator to your domain. This is a bit different than how Justin Pinkney built his toonify generator. If I remember correctly, they train a generator from scratch on toon images and then perform layer swapping between the toonify and FFHQ generators to obtain the blended generator.
Maybe you can perform a similar process?
Generally, the more your generator's latent space is different than FFHQ's latent space, the more difficult this translation will be, so you will want to make sure the two generators are relatively well-aligned.
Other than that, this is a bit beyond the scope of what we tried to do, but I hope this can be of help.
Yes, you're exactly right, and that sounds like the missing ingredient indeed. I'll try to perform the same thing.
So just to make sure I understand the process is:
1) train StyleGAN2 generator on your domain (using transfer learning if nec.) 2) perform layer blending with human domain (let's say FFHQ) 3) train pSp encoder using the blended generator on the inversion task, again using a human domain dataset like FFHQ.
correct?
Thank you !
Sorry, I missed your response! You understood the process correctly. I hope that since the two generators will be more aligned, you will hopefully get more realistic results.
No worries, I should also close the issue for now - I think that was the real issue.
I'm realising that the difference in domains is quite large now, probably larger than for the toonify model so I have to tinker with the generator and layer blending, but should hopefully get there!
Hi, I wonder if you can help me.
Basically I'd like to train a model similar to the Toonify model, except on a different target domain (I'm going for a more hand-drawn cartoony style) with unpaired data - about 1000 examples.
I've tried training a model starting with the ffhq_cartoon_blended weights for about 12000 steps (batch size 4) and the recommended toonify hyper parameters. However, the output don't look very good, they look like an overlaid version of the 'defauly' toonify face and the target image. (See below) I wonder if you have advice for more succesful training. Thanks!