Open h-spiess opened 5 years ago
I suppose that 'p' instead of being a single parameter, represents a tensor containing all the parameters...is it so @gbaydin?
Hello @gbaydin , when model.parameters()
is passed as an argument to the optimizer, it represents a single parameter group
.
In this parameter group, group['params']
contains 2 elements(tensors) (i.e 2 'p' s) for the logreg model; so does that mean that all parameters of the logreg model are represented by 2 tensors and both are updated at each optimization step?
Thanks!
If this is the case, the updated learning rate would be different for both the parameter tensors in each optimization step I suppose.
@gbaydin Can you please clarify on this, thanks.
Hey,
First, nice work! :)
I'm referring to the Adam version (AdamHD). SGD doesn't seem to have that problem.
if i understand the paper correctly the gradient w.r.t. all parameters is used to update the learning rate. the learning rate is then updated once and can be used to do gradient descent on the parameters.
with your implementation, although, the learning rate is successively updated w.r.t. to the current parameter gradient (within the optimizer loop over the parameters) and then directly used for gradient descent on that parameter.
this leads effectively to a different learning rate for every parameter as it is successively modified in the process. only the last parameters in the backpropagation are updated with the learning rate that received the "full" gradient descent step.
am i missing something? thanks for your help :)
Kind regards, Heiner