NVlabs / stylegan3

Official PyTorch implementation of StyleGAN3
Other
6.38k stars 1.12k forks source link

Resuming with a different GPU completely resets the training. #159

Closed AlizerUncaged closed 2 years ago

AlizerUncaged commented 2 years ago

Describe the bug I started training a small 100K images of 256x256 resolution on a Tesla T4 on Colab, after a 5 hours of training I decided to resume on Kaggle with a Tesla P100, I used the same .pkl file where the training on Colab ended but the generated fakes are the same blurry image generated when the training first started on Colab. I used the same exact dataset for both trainings.

I am using the following command to train: !python stylegan3/train.py --outdir "/content/runs" --cfg stylegan3-t --data "/content/anime-faces.zip" --batch-gpu 16 --gpus 1 --batch 32 --snap 5 --gamma 2 --metrics none --resume "/content/network.pkl"

To Reproduce Steps to reproduce the behavior:

  1. Start training on Colab (Tesla T4).
  2. Resume training on Kaggle (Tesla P100).

Fakes generated where it left of on Colab: image

Fakes generated after resuming on a different GPU on Kaggle: image

Expected behavior The generated fakes on the different GPU should be the same as where I left of at Colab.

Desktop (please complete the following information):

Additional context I only resumed via the .pkl file generated by Colab and nothing else.

AlizerUncaged commented 2 years ago

Never mind I was able to fix it by copying the entire resume folder from Colab to Kaggle which contains the .json, fakes and log files and start training the .pkl file inside it.