Closed warner-benjamin closed 2 months ago
This PR adds support for logging the L1 and L2 gradient norms into StableAdamW, following the PyToch clip_grad_norm_ calculation method. It appears to slow down training by 1% at most.
clip_grad_norm_
Checking in code from the server attributed it to @staghado 🙂
gradient norm logging code looks good to me!
Merging since it's training without issue.
This PR adds support for logging the L1 and L2 gradient norms into StableAdamW, following the PyToch
clip_grad_norm_
calculation method. It appears to slow down training by 1% at most.