eladrich / pixel2style2pixel

Official Implementation for "Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation" (CVPR 2021) presenting the pixel2style2pixel (pSp) framework
https://eladrich.github.io/pixel2style2pixel/
MIT License
3.2k stars 568 forks source link

Test results are quite different from the training results #264

Closed binzzheng closed 2 years ago

binzzheng commented 2 years ago

Hi, thank you for sharing such a great job. I applied it to a face dataset, RAVDESS, which consists of 2500 videos recorded by 24 professional actors. I use the images of 20 of them as the training set and the images of the remaining 4 as the test set.

It must be stated that my task is similar to StyleGAN Encoding, i.e. I want the network output to be the same as the target. In fact, my input is only slightly different from the target. So, I used a similar setup with psp Encoder. details as follows:

--dataset_type=ffhq_encode \
--exp_dir=/path/to/experiment \
--workers=8 \
--batch_size=8 \
--test_batch_size=8 \
--test_workers=8 \
--val_interval=2500 \
--save_interval=5000 \
--encoder_type=GradualStyleEncoder \
--start_from_latent_avg \
--lpips_lambda=0.8 \
--l2_lambda=1 \
--id_lambda=0.1

I trained the network for 20W iters, and the output during training has basically achieved my goal. But it works badly on the test set.

training results: You can see that the psp works quite well and is able to generate faces that are faithful to the original image. 200000 205000

test results: The generated faces are completely blurry and far from the training results。 0051_200000 0026_200000

Personally I think the model may be overfitting. During training I randomly sample a single frame image from 2000 videos (each video contains about 100 frames), for example, randomly select a frame from 0~100 frames. Since these 2000 videos only contain faces of 20 actors, and they work just fine, I'm guessing there is too little training data? I wonder if it is possible to first train the model with the FFHQ dataset and then finetune the model on RAVDESS dataset.

For this situation, I would like to ask your opinion.

yuval-alaluf commented 2 years ago

It seems like you indeed are overfitting. Even if you have multiple videos of the same actor, the videos themselves are highly correlated and therefore it makes sense that you are overfitting here. But I think the main problem here is that the input images are not aligned. StyleGAN2 only knows how to generate aligned images and the images you are passing to your network are not aligned, which explains the artifacts in the test. You don't see these artifacts in the training data because you've overfitted to them so you're able to overcome the artifacts. But on the test set, since they're not aligned and not seen during training, you get a lot of artifacts. The solution here is to prealigned all your data before training. In any case, I would start my training from an encoder trained on FFHQ and then finetune on your data. But either way, you need to align all your data beforehand.