coreylowman / dfdx

Deep learning in Rust, with shape checked tensors and neural networks
Other
1.71k stars 99 forks source link

Gradient Clipping #902

Open swfsql opened 9 months ago

swfsql commented 9 months ago

Example using clip_norm:

// (...)
// let loss = dfdx::losses::cross_entropy_with_logits_loss(prediction_y, y);
grads = loss.backward();

// accumulates into norm_squared, then applies clip_norm
let mut norm_squared = 0.;
model.grads_norm_squared(&grads, &mut norm_squared);
model.grads_clip_norm(&mut grads, norm_squared.sqrt(), 1e-2);

opt.update(&mut model, &grads).unwrap();

Note that clip_norm doesn't change the grads "direction" because all grad values are scaled by the same value, while clip_value does changes the direction (because some values are changed while others are left intact). So for gradient descent, where the grads direction is supposed to be somewhat followed, my guess is that clip_norm is better.