Gradient Clipping - Githubissues

Draft state.
Closes #596.
Adds Storage and Gradient view/mutating methods.
- Added dfdx::nn_traits::WithGrads trait and dfdx_derives::WithGrads proc macro, basead on ZeroGrads.
- The overall design is as suggested by https://github.com/coreylowman/dfdx/issues/596#issuecomment-1478357304, allowing custom cpu operations on the elements.
- The ZeroGrads trait could be merged into the WithGrads by mostly just merging their methods.
- Added dfdx_core::tensor::WithStorage trait.
- [ ] Change the interface so Cuda can do more with Cuda kernels, and make the necessary kernels.
- This could be a separated improvement by a future PR. Since grad updates are not made that often, I think leaving things on cpu isn't too bad.
Changed some methods from Gradients:
- Exposed get_mut as pub.
- Exposed get_ref as pub, and lower the requirements from &mut self to &self.
Added gradient clamping and cliping methods.
- [ ] Add examples for all methods (view/mutate grads, clamp and clips).

Example using clip_norm:

// (...)
// let loss = dfdx::losses::cross_entropy_with_logits_loss(prediction_y, y);
grads = loss.backward();

// accumulates into norm_squared, then applies clip_norm
let mut norm_squared = 0.;
model.grads_norm_squared(&grads, &mut norm_squared);
model.grads_clip_norm(&mut grads, norm_squared.sqrt(), 1e-2);

opt.update(&mut model, &grads).unwrap();

Note that clip_norm doesn't change the grads "direction" because all grad values are scaled by the same value, while clip_value does changes the direction (because some values are changed while others are left intact). So for gradient descent, where the grads direction is supposed to be somewhat followed, my guess is that clip_norm is better.

coreylowman / dfdx

Gradient Clipping #902