Closed jscarlson closed 3 years ago
Hi @jscarlson , The problem with the loss converging to an undesirable result could be because of non-optimal hyperparameters. Just to verify that all the parameters are looking good, could you send the command you ran?
Thanks for the quick response!
This run was:
python scripts/train.py --dataset_type=font_inversion --exp_dir=./expr_inv2 --workers=16 --batch_size=16 --test_batch_size=16 --test_workers=16 --val_interval=5000 --save_interval=5000 --encoder_type=GradualStyleEncoder --start_from_latent_avg --lpips_lambda=1 --l2_lambda=1 --id_lambda=0 --w_norm_lambda=0 --output_size=64 --use_wandb --stylegan_weights=/path/to/stylegan2-pytorch/checkpoint/250000.pt
So a lot of the defaults.
I've since tried many runs with e.g. moco lambda set to 0.5, different encoder types, adam vs. ranger, vgg vs. alex for lpips, and all exhibit the same sort of quick convergence.
Another thing to note is that the size of the images I'm working with is 64x64, but the transforms are scaling it up to 256x256... I would set it to the original size, but I believe the repo threw an error when I tried that.
Here are some thoughts that come to mind:
Based on the parameters you sent, I think something that could definitely be useful here is to use the MoCo-based similarity loss. You can use this by setting --moco_lambda=0.5
, for example. This is like an identity loss but for domains that are not faces.
A note on how we scaled our loss weights: if you noticed, we set --lpips_lambda=0.8
and used a batch size of 8
, so that's a weight of 0.1
when you average over the number of samples. If you are using a batch size of 16
, you could try to set a weight of 1.6
, but I am really don't think that this is the core of your problem.
I am curious if resizing to 128x128
rather than 256x256
would help in your case. I think the code should work with that size if you make some small adjustments. Maybe the smaller resize could help?
Finally, I have a follow-up work called ReStyle that builds on pSp and is able to reach must better reconstruction on multiple domains. The codebase works identically to pSp so if you're still unable to get good results with pSp, trying ReStyle could help you. I think this is especially true because ReStyle should be better at reconstructing finer details.
Another bonus with ReStyle is that the code also supports a ResNet encoder that is pre-trained on ImageNet. With pSp, the encoder is pre-trained for facial recognition. If you use the ResNet pre-trained on ImageNet, it could help.
Feel free to let me know If you have any other questions or want to further discuss possible approaches.
Thank you again for the thorough feedback! I'm interested in trying out ReStyle as well.
As it turned out, what helped dramatically was performing no resizing at all, i.e., letting the StyleGAN2 generator generate at 64x64, the same resolution I pretrained it on, and compute losses at 64x64. I was initially deterred from training pSp on this original resolution because of some error messages I received when trying to do so, and didn't think much of scaling up the image (I assumed the model weights would just learn to adapt to this resizing), but it actually made all the difference in the world!
For others facing a similar problem: in particular you can see in line
has resize
silently set to True
, which applies
Setting resize
to False
disables this, and then --output_size
can really do its thing!
Really cool to see that the simplest solution is usually what works :)
I've actually been planning on doing something similar in something I'm trying and wanted to make sure I understood what you did that worked for you.
So your original images are of size 64x64
, which you resized to 256x256
before passing to your encoder. Then, the outputs that you got from StyleGAN are of size 64x64
. Then, when computing the losses, you computed the losses between the original 64x64
images and the 64x64
outputs you got from the generator?
Hi! Sorry for the delayed response.
Essentially, there was no resizing at all. What I did was:
Hi!
Firstly, thank you for developing this model! A StyleGAN encoder seems invaluable for future conditional image synthesis research.
Right now, I'm training a pSp encoder for inversion on non-facial images (images of characters and letters from various fonts), and have been experiencing the problem of fast loss convergence to unsatisfactory performance. I.e.,
Two additional things to note: (a) using the StyleGAN2 model I pre-trained, I can use the rosinality projection script to nearly perfectly project given characters into the StyleGAN2 latent space, so I know there exist some latent codes in W+ that correspond to the input images; and (b) the pSp encoder is definitely doing something reasonable, as the output images at various global steps do appear to start to resemble their sources/targets, just far from sufficiently, and super slowly, given that from the loss the model looks like it has hit some local min.
My questions: is this to be expected--do I just need to train for much longer, given that the loss is still somewhat decreasing?; and, given the model is doing something reasonable, is getting good inversion for my image domain just a matter of hyperparameter tuning, and, if so, do you have any suggestions about loss weights/LR/etc.?
Thanks!