The ZeroGrads trait could be merged into the WithGrads by mostly just merging their methods.
Added dfdx_core::tensor::WithStorage trait.
[ ] Change the interface so Cuda can do more with Cuda kernels, and make the necessary kernels.
This could be a separated improvement by a future PR. Since grad updates are not made that often, I think leaving things on cpu isn't too bad.
Changed some methods from Gradients:
Exposed get_mut as pub.
Exposed get_ref as pub, and lower the requirements from &mut self to &self.
Added gradient clamping and cliping methods.
[ ] Add examples for all methods (view/mutate grads, clamp and clips).
Example using clip_norm:
// (...)
// let loss = dfdx::losses::cross_entropy_with_logits_loss(prediction_y, y);
grads = loss.backward();
// accumulates into norm_squared, then applies clip_norm
let mut norm_squared = 0.;
model.grads_norm_squared(&grads, &mut norm_squared);
model.grads_clip_norm(&mut grads, norm_squared.sqrt(), 1e-2);
opt.update(&mut model, &grads).unwrap();
Note that clip_norm doesn't change the grads "direction" because all grad values are scaled by the same value, while clip_value does changes the direction (because some values are changed while others are left intact). So for gradient descent, where the grads direction is supposed to be somewhat followed, my guess is that clip_norm is better.
Draft state.
Closes #596.
Adds Storage and Gradient view/mutating methods.
dfdx::nn_traits::WithGrads
trait anddfdx_derives::WithGrads
proc macro, basead onZeroGrads
.ZeroGrads
trait could be merged into theWithGrads
by mostly just merging their methods.dfdx_core::tensor::WithStorage
trait.Changed some methods from
Gradients
:get_mut
aspub
.get_ref
aspub
, and lower the requirements from&mut self
to&self
.Added gradient clamping and cliping methods.
Example using clip_norm:
Note that
clip_norm
doesn't change the grads "direction" because all grad values are scaled by the same value, whileclip_value
does changes the direction (because some values are changed while others are left intact). So for gradient descent, where the grads direction is supposed to be somewhat followed, my guess is thatclip_norm
is better.