clovaai / stargan-v2

StarGAN v2 - Official PyTorch Implementation (CVPR 2020)
Other
3.5k stars 658 forks source link

Issue replicating results #78

Open NOlivier-Inria opened 4 years ago

NOlivier-Inria commented 4 years ago

I've tried to replicate your results for celeba-hq using your instructions.

First downloading the data :

bash download.sh celeba-hq-dataset
bash download.sh wing

Then executing the command given on this github, for 100k iterations (the only difference being --img_size 128):

python main.py --mode train --num_domains 2 --w_hpf 1 \
               --lambda_reg 1 --lambda_sty 1 --lambda_ds 1 --lambda_cyc 1 \
               --train_img_dir data/celeba_hq/train \
               --val_img_dir data/celeba_hq/val --img_size 128

The results I am getting are quite worse than those of your pretrained model : image

FID (latent : reference) at 100k: (16.80 : 19.65) instead of (13.73 : 23.84). LPIPS (latent : reference) at 100k: (0.232 : 0.228) instead of (0.452 : 0.389). Interestingly, performance is better at 50k iterations : FID of (13.49 : 19.46), LPIPS of (0.303 : 0.252).

I use pytorch 1.4.0, torchvision 0.5.0, and cudatoolkit 10.0.130, as in the dependencies install instructions conda install -y pytorch=1.4.0 torchvision=0.5.0 cudatoolkit=10.0 -c pytorch

What could explain this behavior ? Would it be that 128px training require different parameters, to replicate the results ?

yunjey commented 4 years ago

@NOlivier-Inria

Would it be that 128px training require different parameters, to replicate the results ?

Yes, It is better to try several hyperparameters for different resolutions. Reducing w_hpf makes the generator to produce diverse images while less preserving the source identity. Try lower values of w_hpf than 1 (e.g., 0.1, 0.25, 0.5).

NOlivier-Inria commented 4 years ago

Thank you for your answer. I re-trained with lower w_hpf, and got slighly better results. w_hpf = 0.25 : FID (15.82 : 18.57), LPIPS (0.260 : 0.241) w_hpf = 0.01 : FID (13.42 : 16.87), LPIPS (0.266 : 0.232)

Interesting that the FID is slighly better, but the LPIPS remains worse (than the 256px model, and the 50k 128px one). I suppose that the missing conv layer - compared to a 256px network - can help explain it.