Closed fzimmermann89 closed 4 days ago
Also, foreach does not make sense as a setting in our use case.
If we now add more optimizers, we should reconsider what arguments our optimizers should have and check that they make sense
@ckolbPTB @koflera
Currently we have these parameters:
amsgrad
whether to use the AMSGrad variant of this algorithm from the paper
`On the Convergence of Adam and Beyond`, by default False
foreach
whether `foreach` implementation of optimizer is used, by default None
maximize
maximize the objective with respect to the params, instead of minimizing, by default False
differentiable
whether autograd should occur through the optimizer step. This is currently not implemented.
fused
whether the fused implementation (CUDA only) is used. Currently, torch.float64, torch.float32,
torch.float16, and torch.bfloat16 are supported., by default None
Suggestion: remove all and only add them back if we really need them
I think I agree with the suggestion. The only perhaps useful parameter would be maximized, but one can easily implement this by setting the functional f
to be maximized to -f
and minimize that instead.
I would say the only useful parameter is amsgrad.
everything else does either not make sense, does not matter, or can be achieved by setting f to -f.
We have a
differentiable
flag in the adam optimizer. This will not work as expected as we always detach the parameters.Also, we should reconsider which of the optimizer flags we actually support. I removed the
capturable
, in #193 as this is not useful for us.