JegZheng / truncated-diffusion-probabilistic-models

Pytorch implementation of TDPM
MIT License
31 stars 3 forks source link

loss functions become NAN after training just 4 steps #3

Open WCH102588BY opened 1 year ago

WCH102588BY commented 1 year ago

Hi,authors:

NFO - diffusion.py - 2023-01-04 22:14:24,391 - Epoch: 0, step: 1, loss: 3301.66259765625, implicit loss: 0.6348565220832825, data time: 0.10419130325317383 INFO - diffusion.py - 2023-01-04 22:14:28,334 - Epoch: 0, step: 2, loss: 2478.3759765625, implicit loss: 0.693336009979248, data time: 0.05294513702392578 INFO - diffusion.py - 2023-01-04 22:14:28,975 - Epoch: 0, step: 3, loss: 1856.43701171875, implicit loss: 0.7036036252975464, data time: 0.0358583132425944 INFO - diffusion.py - 2023-01-04 22:14:29,614 - Epoch: 0, step: 4, loss: 1508.215087890625, implicit loss: 0.6875728964805603, data time: 0.026934266090393066 INFO - diffusion.py - 2023-01-04 22:14:30,250 - Epoch: 0, step: 5, loss: nan, implicit loss: nan, data time: 0.02159852981567383 INFO - diffusion.py - 2023-01-04 22:14:30,885 - Epoch: 0, step: 6, loss: nan, implicit loss: nan, data time: 0.018257300059000652 INFO - diffusion.py - 2023-01-04 22:14:31,520 - Epoch: 0, step: 7, loss: nan, implicit loss: nan, data time: 0.015853064400809153 INFO - diffusion.py - 2023-01-04 22:14:32,156 - Epoch: 0, step: 8, loss: nan, implicit loss: nan, data time: 0.013890713453292847 INFO - diffusion.py - 2023-01-04 22:14:32,790 - Epoch: 0, step: 9, loss: nan, implicit loss: nan, data time: 0.012373791800604926

my problem shown above

Traininig the CIFAR10 dataset,apply the default cifiar10.yml .change nothong but batch_size to 64. configuration : python 3.9 torch 1.13

moreover,I dont qiute understand
channel_base = 32768, # Overall multiplier for the number of channels. resolution :[4,8,16,32] 4816*32 ? how to get it?

Hope you give some advices,Thank u