NVlabs / NVAE

The Official PyTorch Implementation of "NVAE: A Deep Hierarchical Variational Autoencoder" (NeurIPS 2020 spotlight paper)
https://arxiv.org/abs/2007.03898
Other
999 stars 163 forks source link

NaN values in gradients #29

Open liuem607 opened 3 years ago

liuem607 commented 3 years ago

Hi, in my experiment, I used Moving-MNIST dataset. But here are my problems during training that I couldn't find an answer:

I tried to play with a small network by using only num_latent_scale=1 and num_groups_per_scale=1. Then I realized there were no gradients generated for parameters including prior.ftr0 and an error was given to stop the training.

If I increase num_groups_per_scale from 1 to 2 or more, I still got Nan in some of the gradients in the first iteration, then they went away, but the training continues without errors.

I'm wondering if you could provide some hint or clue to why such behavior happens? Thank you in advance!

arash-vahdat commented 3 years ago

Hi, getting no gradient for num_latent_scale=1 and num_groups_per_scale=1 is weird. By no gradients, do you mean that the gradients were zero or None? If they were zero, do you see any changes after some time of training?

Getting NaN in gradient is natural especially at the beginning of the training. We are using mixed precision which means that most operations are cast to FP16. Because of the lower precision, we may get NaN easily and it's autocast and grad_scalar's job to drop these gradients and scale the loss such that we don't get NaN.

You can disable mixed-precision by supplying enabled=False to autocast() at this line: https://github.com/NVlabs/NVAE/blob/38eb9977aa6859c6ee037af370071f104c592695/train.py#L163