Open creiser opened 4 years ago
Hello @creiser. Thanks for your suggestion. I'm afraid I'm being a little slow to understand precisely which buffers you refer to here? Could you please:
diffopt.step
call) and ideally also what actually happens?I'll try to get on it as soon as I have time.
Hello @egrefen,
thanks for the quick reply.
I'm afraid I'm being a little slow to understand precisely which buffers you refer to here
With buffers I mean for example the momentum vector that needs to be stored for e.g. SGD. Other optimizers have other such buffers or states. Currently you store a "differentiable" version of each of these buffers, which makes sense for this episodic training that you need in MAML where you start from a certain initalization and then "simulate" the inner loop steps, but other bi-level algorithms update the parameters during the inner loop "permanently" instead of only "simulating" steps.
give me a toy/concrete example, and also
Below follows my code. As you can see for the buffer there are two versions. A differentiable version and a non-differentiable one. The differentiable version is needed for the computational graph of the current inner loop update and the non-differentiable version is maintained to be treated as a constant for the inner loop updates of coming outer loop iterations.
The difference between my code and yours is that you also write the differentiable version into memory. This makes the computational graph grow with the number of outer loop steps, if you do not reset the inner loop optimizer, i.e. detach the buffers/states or what you are doing right now: You destroy the inner loop optimizers buffers after each outer loop iteration.
class DifferentiableSGD():
def __init__(self, parameters, lr, momentum = 0, weight_decay = 0, nesterov = False):
self.parameters = parameters
self.lr = lr
self.momentum = momentum
self.weight_decay = weight_decay
self.nesterov = nesterov
if self.momentum != 0:
self.momentum_buffer = [None for _ in parameters]
def step(self, grads):
updated_params = []
for param_idx, (param, grad) in enumerate(zip(self.parameters, grads)):
if self.weight_decay != 0:
grad = grad + param * self.weight_decay
if self.momentum != 0:
if self.momentum_buffer[param_idx] is None:
self.momentum_buffer[param_idx] = grad.clone().detach_()
differentiable_buf = grad
else:
buf = self.momentum_buffer[param_idx]
differentiable_buf = buf * self.momentum + grad
buf.data = differentiable_buf
if self.nesterov:
grad = grad.add(self.momentum, differentiable_buf)
else:
grad = differentiable_buf
updated_params.append(param - self.lr * grad)
return updated_params
specify what condition you would like to hold at the end of the a loop (or a diffopt.step call) and ideally also what actually happens?
There needs to be a detached version of the buffers/states in the inner loop optimizers, which can be used for inner loop updates in coming outer loop iterations.
a proposal for how this feature would be accessible to the user is welcome (e.g. keyword arg when defining loop? model? context? when using diffopt.step? all of the above?)
Will think about it.
I am using a step size of 1 for my inner loop. Nevertheless I want to use adaptive optimizers in the inner loop and carry the buffers across outer loop iterations. In the context of e.g. MAML that would not make sense, but for other bi-level optimization problems this is useful. Mathematically this means that we are treating the buffers (e.g. momentum vector) as constants. This can be easily implemented by writing a detached copy into the buffers. Ideally one should have exact control when the computational graph of the buffers should be cut off.