Open MarSaKi opened 4 years ago
me too
For more details, we first provide correspondence between math notations (i.e., $x$) in ReverseHG paper and our code (i.e., x
). See Algorithm 1 and Equation 11 in the paper.
MetaSGD
's parameters (i.e., target parameters).source_optimizer
in the main file).alpha_groups
.p.grad
where p
is a parameter tensor in $\lambda$.Our code is following:
alpha_groups[-1]
) and $\Phi(s,\lambda)$, then store the value into X
(see L74-L82).X
's gradient by X.backward()
(see L83-L84). This is same as Hessian-vector multiplication. Then, $\alpha B$ is accumulated into $g$ as described in Algorithm 1, and $\alpha A$ is stored in p.grad
where p
is a parameter tensor in $s$.wd
) or momentum (momentum
) is not zero.meta_backward
, the gradient of $\lambda$ (i.e., $g$) is stored into corresponding p.grad
s. Thus, we just use source_optimizer.step()
for updating $\lambda$.I think our code is more easier to understand when using vanilla SGD (i.e., wd=0
and momentum=0
).
can you give more math details about "meta_backward", I just read the code and Reverse_HG, but I couldn't understand your code.