Fast, undesirable loss convergence on non-facial image domain

jscarlson commented 3 years ago

Hi!

Firstly, thank you for developing this model! A StyleGAN encoder seems invaluable for future conditional image synthesis research.

Right now, I'm training a pSp encoder for inversion on non-facial images (images of characters and letters from various fonts), and have been experiencing the problem of fast loss convergence to unsatisfactory performance. I.e.,

W B Chart 9_13_2021, 4_30_27 PM

Two additional things to note: (a) using the StyleGAN2 model I pre-trained, I can use the rosinality projection script to nearly perfectly project given characters into the StyleGAN2 latent space, so I know there exist some latent codes in W+ that correspond to the input images; and (b) the pSp encoder is definitely doing something reasonable, as the output images at various global steps do appear to start to resemble their sources/targets, just far from sufficiently, and super slowly, given that from the loss the model looks like it has hit some local min.

My questions: is this to be expected--do I just need to train for much longer, given that the loss is still somewhat decreasing?; and, given the model is doing something reasonable, is getting good inversion for my image domain just a matter of hyperparameter tuning, and, if so, do you have any suggestions about loss weights/LR/etc.?

Thanks!

yuval-alaluf commented 3 years ago

Hi @jscarlson , The problem with the loss converging to an undesirable result could be because of non-optimal hyperparameters. Just to verify that all the parameters are looking good, could you send the command you ran?

jscarlson commented 3 years ago

Thanks for the quick response!

This run was: python scripts/train.py --dataset_type=font_inversion --exp_dir=./expr_inv2 --workers=16 --batch_size=16 --test_batch_size=16 --test_workers=16 --val_interval=5000 --save_interval=5000 --encoder_type=GradualStyleEncoder --start_from_latent_avg --lpips_lambda=1 --l2_lambda=1 --id_lambda=0 --w_norm_lambda=0 --output_size=64 --use_wandb --stylegan_weights=/path/to/stylegan2-pytorch/checkpoint/250000.pt

So a lot of the defaults.

I've since tried many runs with e.g. moco lambda set to 0.5, different encoder types, adam vs. ranger, vgg vs. alex for lpips, and all exhibit the same sort of quick convergence.

jscarlson commented 3 years ago

Another thing to note is that the size of the images I'm working with is 64x64, but the transforms are scaling it up to 256x256... I would set it to the original size, but I believe the repo threw an error when I tried that.

yuval-alaluf commented 3 years ago

Here are some thoughts that come to mind:

Based on the parameters you sent, I think something that could definitely be useful here is to use the MoCo-based similarity loss. You can use this by setting --moco_lambda=0.5, for example. This is like an identity loss but for domains that are not faces.

A note on how we scaled our loss weights: if you noticed, we set --lpips_lambda=0.8 and used a batch size of 8, so that's a weight of 0.1 when you average over the number of samples. If you are using a batch size of 16, you could try to set a weight of 1.6, but I am really don't think that this is the core of your problem.

I am curious if resizing to 128x128 rather than 256x256 would help in your case. I think the code should work with that size if you make some small adjustments. Maybe the smaller resize could help?

Finally, I have a follow-up work called ReStyle that builds on pSp and is able to reach must better reconstruction on multiple domains. The codebase works identically to pSp so if you're still unable to get good results with pSp, trying ReStyle could help you. I think this is especially true because ReStyle should be better at reconstructing finer details.

Another bonus with ReStyle is that the code also supports a ResNet encoder that is pre-trained on ImageNet. With pSp, the encoder is pre-trained for facial recognition. If you use the ResNet pre-trained on ImageNet, it could help.

Feel free to let me know If you have any other questions or want to further discuss possible approaches.

jscarlson commented 3 years ago

Thank you again for the thorough feedback! I'm interested in trying out ReStyle as well.

As it turned out, what helped dramatically was performing no resizing at all, i.e., letting the StyleGAN2 generator generate at 64x64, the same resolution I pretrained it on, and compute losses at 64x64. I was initially deterred from training pSp on this original resolution because of some error messages I received when trying to do so, and didn't think much of scaling up the image (I assumed the model weights would just learn to adapt to this resizing), but it actually made all the difference in the world!

For others facing a similar problem: in particular you can see in line

https://github.com/eladrich/pixel2style2pixel/blob/d93c1941f651b1d2f16767375211c438fa940157/models/psp.py#L69

has resize silently set to True, which applies

https://github.com/eladrich/pixel2style2pixel/blob/d93c1941f651b1d2f16767375211c438fa940157/models/psp.py#L32

Setting resize to False disables this, and then --output_size can really do its thing!

yuval-alaluf commented 3 years ago

Really cool to see that the simplest solution is usually what works :) I've actually been planning on doing something similar in something I'm trying and wanted to make sure I understood what you did that worked for you. So your original images are of size 64x64, which you resized to 256x256 before passing to your encoder. Then, the outputs that you got from StyleGAN are of size 64x64. Then, when computing the losses, you computed the losses between the original 64x64 images and the 64x64 outputs you got from the generator?

jscarlson commented 3 years ago

Hi! Sorry for the delayed response.

Essentially, there was no resizing at all. What I did was:

Collect a bunch of images I was interested in, which were all approximately 64x64, and resize them to be exactly 64x64.
Train a StyleGAN2 generator with the 64x64 images, the outputs of which are 64x64.
Pass the original 64x64 images to the encoder (no resizing).
Losses computed between 64x64 StyleGAN2 outputs and and 64x64 original images.

eladrich / pixel2style2pixel

Fast, undesirable loss convergence on non-facial image domain #195