Closed fhahlbohm closed 6 months ago
This is something that's unclear to me, I don't understand the internals of the gradscaler implementation. I think you can set cache_enabled=False? Please let me know if that works for you.
I am also interested in this.
I did not have the time to test the suggested solution yet. Will post an update as soon as I find the time for it!
Have you tried with torch.nn.utils.clip_grad_norm_
?
does the current optimizer only work for fp32 training? or does it also work with amp?
So far in our experiments it seems to be working correctly with GradScaler and autocast in our experiments using the nanogpt codebase.
@fhahlbohm were you able to find a solution for this? I also got NaN losses when trained on multiple GPUs.
I am trying to make AdamWScheduleFree work with an optimization pipeline that uses https://pytorch.org/docs/stable/amp.html
More specifically, forward passes use
torch.cuda.amp.autocast
and gradients are scaled usingtorch.cuda.amp.GradScaler
.Here is some pseudocode for a training iteration:
Sadly, model outputs outside of training seem to contain NaN values. I saw the README states that additional steps might be necessary for my use case.
Is there an established way of doing this?