General clarification of real to cartoon image conversion

eladrich / pixel2style2pixel

Official Implementation for "Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation" (CVPR 2021) presenting the pixel2style2pixel (pSp) framework

https://eladrich.github.io/pixel2style2pixel/

MIT License

3.19k stars 570 forks source link

General clarification of real to cartoon image conversion #295

Closed kimyanna closed 1 year ago

kimyanna commented 1 year ago

Hello, my goal is to convert an image of any face into its cartoon version. I have a set of ~6k image pairs (real face and its cartoon version) for training. To be able to convert a never seen face into its cartoon version I need to do the following:

Use the ~6k image pairs to train a PSP encoder using scripts/train.py (this will likely take several hundreds of thousands iterations). I'm saving best_model.pt and watching the loss go down while training.
When best_model.pt starts producing satisfactory results I can use best_model.pt as a --checkpoint_path parameter of scripts/inference.py and feed a previously unseen aligned image of a face into scripts/inference.py.

Is my understanding of the process correct? Would this process produce a model able to turn an image of a face into its cartoon version? Also, do you have a recommendation on how to split train/test data in dataset_psp?

Thank you!!

yuval-alaluf commented 1 year ago

You understood correctly :) When defining the data, you can set your source to the path of real images and the target to the path of the cartoon images (with corresponding names). Then inference is performed as you mentioned. You can pass new aligned face images and get their cartoon version. A small note, training should converge must faster. You can examine the outputted images during training to see when the test results look good. Regarding the data splitting, you can try splitting and leave say 10% of the pairs as test just to see that the more is able to generalize to unseen faces.

kimyanna commented 1 year ago

Perfect, thank you for great answer! Seems like there are two approaches of converting real faces into cartoons:

The one described above where we train PSP encoder on real->cartoon pairs and then feed a new face into the best model
Or: Using PSP encoder to get a latent representation of a real face in NPZ format and then feeding the latent into a fine tuned (cartoon) StyleGAN model

Seems like both of these approaches would achieve a similar outcome, are there pros/cons for each of these approaches?

Thank you!

yuval-alaluf commented 1 year ago

You're correct that those are the main two approaches that I am familiar with. I think if you have paired data, the first approach is almost always preferred since you're explicitly training the model to achieve some mapping between real and cartoon pairs. If you don't have paired data, then you must use the second approach and train only with real face images. The down-side of that is that there is very weak supervision here and therefore the outputted images are not very "cartoon-like". If you train too much, you may start getting images that look more like real faces than cartoon faces. Since you have paired data, I think the first option is the way to go.

kimyanna commented 1 year ago

Perfect, thank you for great explanation! Closing the issue.