Hello,
I wanted to know if it happened also to you during training to have the model outputting full nan tensors. It happens to me some times and I wanted to know if it is a problem of the model or it is a problem of my setup.
I'm currently training a tiny version of the model in order to make it enter in RAM so I had to drop some layers of the final stage and in general the number of heads, dims and etc.
EDIT:
I forgot to mention I'm training on mixed precision for memory issues
Hello, I wanted to know if it happened also to you during training to have the model outputting full nan tensors. It happens to me some times and I wanted to know if it is a problem of the model or it is a problem of my setup. I'm currently training a tiny version of the model in order to make it enter in RAM so I had to drop some layers of the final stage and in general the number of heads, dims and etc.
EDIT:
I forgot to mention I'm training on mixed precision for memory issues
You have any idea why this can happen?