Closed indigopyj closed 3 years ago
Hi,
There is definitely something wrong, the loss becomes nan after the first update. Make sure that you use a batch size that is a multiple of the number of GPUs.
If the problem still appears, perhaps it's worth checking your kaolin build or the PyTorch setup. Which version are you currently using? (CUDA version as well)
I used CUDA 10.1 and pytorch 1.7. But the problem was resolved when I changed 1.7 to 1.6! Thanks anyway!
I just followed your training step(I didn't change the code at all) and I was training GAN with cub dataset. But I got this result during training epoch 0.
I am so confused now because I did change nothing and I just follow your instruction. I only changed the number of gpus, 4 gpus to 3 gpus.