eladrich / pixel2style2pixel

Official Implementation for "Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation" (CVPR 2021) presenting the pixel2style2pixel (pSp) framework
https://eladrich.github.io/pixel2style2pixel/
MIT License
3.19k stars 570 forks source link

Pushing performance #126

Closed snakch closed 3 years ago

snakch commented 3 years ago

Hi there!

So I've managed to train a model on my own dataset which is starting to look very good. There are still some details that I'd like to improve if possible. For context, I am attempting to perform a task similar to the Toonify model, but on a different domain. I've trained and blended my own StyleGAN2 model, and I'm trying psp on FFHQ. The issues I'm seeing are:

The training parameters I use are:

"--batch_size=4", "--max_steps=21000", "--encoder_type=GradualStyleEncoder", "--start_from_latent_avg", "--lpips_lambda=0.4", "--id_lambda=1.0", "--w_norm_lambda=0.02", "--l2_lambda=1"

My transforms are:

transforms_dict = {
    'transform_gt_train': transforms.Compose([
        transforms.Resize((256, 256)),
        transforms.RandomHorizontalFlip(0.5),
        transforms.ToTensor(),
        transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])]),
    'transform_source': None,
    'transform_test': transforms.Compose([
        transforms.Resize((256, 256)),
        transforms.ToTensor(),
        transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])]),
    'transform_inference': transforms.Compose([
        transforms.Resize((256, 256)),
        transforms.ToTensor(),
        transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])])
}

I tried removing RandomHorizontalFlips but to no avail for face orientation.

All of this may just be a limitation of the encoder + my own Stylegan and I'm already quite happy with the results but any suggestions you have would be great!

Here are some training images illustrating what I mean: 13300

And for reference, some samples of my StyleGAN2 model: 000003 000015

yuval-alaluf commented 3 years ago

Hi @snakch , Thanks for providing all the details! It is very surprising that your results end up looking like they are frontalized, but your transforms and losses look good (and you even mentioned you removed the flips). I would have expected the encoder to better capture the pose of the inputs so it's interesting to see why that is. Unfortunately, I can't think of a single thing that would be able to help you improve the results.

Generally, we've found that performing the translation when you don't have paired data can be quite challenging. Therefore, in a recent paper that I published called ReStyle we introduced an "encoder bootstrapping" technique that is able to achieve transformation results on the toonification task that is better able to capture the toon style. The code is available here and the relevant part for your is here. Basically, if you want, you can re-trained your model using ReStyle and then pair the pSp ffhq_encode model with your ReStyle model to perform the translation. I found the technique is able to out-perform what you would achieve with using vanilla pSp. Even though we only tested it on the toonification task, I think it can also be relevant for your task. You are welcome to ask any questions about it here or in the other repo.

snakch commented 3 years ago

Looking forward to digging into restyle, and I'll keep scratching my head for the frontalization issue (I'm guessing Restyle won't help there).

In the meantime I'll close this issue. Thank you!

snakch commented 3 years ago

Actually, sorry for reopening this but I had a thought/ question.

I'm wondering whether part of the issue for unpaired training is that in order to get back to the original 'real faces' dataset, pSp learn encodings which correspond to the more realistic looking part of my 'drawn faces' dataset. My current arguments for this are:

1) These 'more realistic' faces exist in my StyleGAN space since I used transfer learning from a Stylegan trained on real faces, so it might have some residual memory of what a real face looks like. (This point is quite hand-wavy and needs verification)

2) It seems intuitive that by mapping to these realistic faces, pSp will minimise L2 and lpips losses at least.

The downside is that your samples look less cartoonish than you might want.

I wonder then, whether this can be fixed by moving very slightly the encoding, maybe only doing so for early layers so that identity and higher level features are preserved.

One could try doing this at inference, or maybe even at training time as a form of data augmentation. Have you tried something like this by chance?

yuval-alaluf commented 3 years ago

What you said definitely makes sense. At the end of the day the pSp encoder is trying to reconstruct the original input image so it makes sense that the "drawn" style isn't preserved to well. In fact, if you train long enough, the "drawn" style will probably disappear completely (I saw this when training the toonify model).

What you talked about moving the encoding makes sense. I'm curious what happens if you do the following:

  1. Use the pre-trained pSp encoder to encode a given image into a latent l.
  2. Move l towards the average latent code of your "drawing" StyleGAN. The term "towards the average latent code" is not well defined since I am not entirely sure how much you need to move and I am not even sure this even makes sense :)

My thinking here is that step 1 will enocde the image and preserve key features such as identity while step 2 will push the latent towards the "drawn" style that you're looking for.

P.S. What I described here is kind of similar to the idea I played with in ReStyle --- first encode the image and then learn a residual with respect to the encoded latent. But maybe you can get the same effect without needing to train a model at all.

snakch commented 3 years ago

Ok thanks, I'll play around with these ideas when I have a bit more time, interesting stuff!

In the meantime I'm training a ReStyle model - it seems that face alignment is better there (though I suspect due to the slightly different loss parameters since alignment seems to be preserved from the first iteration) I guess that's ML for you :P

snakch commented 3 years ago

So just as an update, what you suggested above is actually the same as doing truncation on the decoder if I'm not mistaken. It was very quick to add that parameter back in and here are a couple of results.

Below the first column is real, the second is the pSp output and the third on is with 0.7 truncation. All in all, not bad I think.

Screenshot 2021-04-28 at 12 23 16

yuval-alaluf commented 3 years ago

The right most images definitely seem to better capture the style. But it doesn't seem like they're anything like the inputs. Are these inversions at the initial steps of training?

P.S. nice trick with using the truncation. It seems like a lot of people struggle with performing the translation and keeping the style so something like this could definitely help in other cases.

snakch commented 3 years ago

Thanks!

Indeed it was at the end of training. I think the issue is that when truncating, really you're moving all latent layers towards the average face. Probably one should only do that for later layers, or at least the layers which don't encode information about features that really matter for identity. I might experiment with that later - running some other experiments atm.

snakch commented 3 years ago

So I've made a few tweaks which I think yield good results.

So the starting point was the model trained on unpaired data which, while preserving likeness, had lost style, and had this uncanny valley effect:

Screenshot 2021-05-04 at 11 29 16

So I decided to switch tack and create a bunch of paired data. The problem there is that style is preserved, but not identity, since the paired samples don't themselves preserve identity.

9000

So I decided to change the id loss terms. I added the ability to inject id losses compared to the input image, not just the output image.

51700

Looking good! Except that this does not generalise so well to the domain of real images, not generated by stylegan. Eg:

Screenshot 2021-05-04 at 11 32 52

Again, makes sense... the model never saw real data. So I decided to fintetune it on unpaired ffhq data with basically only id losses, almost no l2 or lpips losses to preserve the style. After a few iterations:

Screenshot 2021-05-04 at 11 34 31

Much better, I think! Anyway sorry for the long-winded post. I'm sure there's a lesson in there somewhere or something simpler I could have done, but I can't see it yet. Still I think the results are interesting!

yuval-alaluf commented 3 years ago

I think the results are pretty cool! I really like the flow so really appreciate you sharing this. I found it useful myself and am sure other people will likewise find it useful! I personally like the idea of integrating both synthetic images and real images in the training. I've done this myself at times and seen other people use this as well so cool to see that it helped a bit here as well.

snakch commented 3 years ago

All made possible by having such a nicely organised and legible codebase :) I'll close the issue now.