Reasoning behind constraint in `grads_and_grad_moms`

MethodsOfMachineLearning / probabilistic_line_search

Probabilistic line search algorithm for stochastic optimization with a TensorFlow interface.

Apache License 2.0

22 stars 2 forks source link

Reasoning behind constraint in `grads_and_grad_moms` #1

Open aomader opened 7 years ago

aomader commented 7 years ago

Hej,

I'm wondering about the reasoning behind this constraint in grads_and_grad_moms. The given variables are indeed used in multiple operations in the loss computation graph, nonetheless that shouldn't hinder it to compute the grad.

Could you shed some light on this?

lballes commented 7 years ago

Hey!

The second moment of gradients, which is needed for the gradient variance computation, is computed implicitly using a "trick" explained in this note. This trick is not directly applicable when a variable has more than one incoming gradients during back-propagation, which is the case whenever a variable is consumed by more than one operation.

Since we currently don't have an efficient way to compute the gradient moment in such a case, we ruled out multiple consumer operations. We are trying to come up with a better implementation of the gradient moment computation; any ideas/help would be highly appreciated :)

aomader commented 7 years ago

@lballes I would say the most straight forward way would be to fall back to an inefficient method to compute the second moment, but warnings.warn() the user that the trick doesn't apply to the supplied objective function.

hughsalimbeni commented 7 years ago

Hi, have you tried a running average estimate like in ADAM:

m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g

(taken from here)

The requirement for all operations to be matmul, add and conv2d seems rather restrictive

lballes commented 7 years ago

Hi @hughsalimbeni, we did some general experiments comparing running average estimates to mini-batch estimates and found the former to be quite poor. That's why we went with the mini-batch estimate for this implementation, which is of course very restrictive for practical use in its current implementation. It would definitely be worth exploring how the line search would actually perform using the running average estimates. I don't know whether I personally will find time to do that anytime soon, though.