gbaydin / hypergradient-descent

Hypergradient descent
MIT License
136 stars 20 forks source link

The updated learning rate is different for every parameter in AdamHD #9

Open h-spiess opened 5 years ago

h-spiess commented 5 years ago

Hey,

First, nice work! :)

I'm referring to the Adam version (AdamHD). SGD doesn't seem to have that problem.

if i understand the paper correctly the gradient w.r.t. all parameters is used to update the learning rate. the learning rate is then updated once and can be used to do gradient descent on the parameters.

with your implementation, although, the learning rate is successively updated w.r.t. to the current parameter gradient (within the optimizer loop over the parameters) and then directly used for gradient descent on that parameter.

this leads effectively to a different learning rate for every parameter as it is successively modified in the process. only the last parameters in the backpropagation are updated with the learning rate that received the "full" gradient descent step.

am i missing something? thanks for your help :)

Kind regards, Heiner

harshalmittal4 commented 5 years ago

I suppose that 'p' instead of being a single parameter, represents a tensor containing all the parameters...is it so @gbaydin?

harshalmittal4 commented 5 years ago

Hello @gbaydin , when model.parameters() is passed as an argument to the optimizer, it represents a single parameter group. In this parameter group, group['params'] contains 2 elements(tensors) (i.e 2 'p' s) for the logreg model; so does that mean that all parameters of the logreg model are represented by 2 tensors and both are updated at each optimization step? Thanks!

harshalmittal4 commented 5 years ago

If this is the case, the updated learning rate would be different for both the parameter tensors in each optimization step I suppose.

harshalmittal4 commented 5 years ago

@gbaydin Can you please clarify on this, thanks.