Math details about "meta_backward"

MarSaKi commented 4 years ago

can you give more math details about "meta_backward", I just read the code and Reverse_HG, but I couldn't understand your code.

wangxiang1099 commented 4 years ago

me too

hankook commented 4 years ago

For more details, we first provide correspondence between math notations (i.e., $x$) in ReverseHG paper and our code (i.e., x). See Algorithm 1 and Equation 11 in the paper.

$s$ is MetaSGD's parameters (i.e., target parameters).
$\lambda$ is the meta-parameters of the meta networks (see source_optimizer in the main file).
$\alpha$ is alpha_groups.
$g$, the gradient of $\lambda$, is accumulated in p.grad where p is a parameter tensor in $\lambda$.

Our code is following:

Compute $\Phi(s,\lambda)$ (see L69-L73 and Equation 11).
Compute inner-product between $\alpha$ (i.e., alpha_groups[-1]) and $\Phi(s,\lambda)$, then store the value into X (see L74-L82).
Compute X's gradient by X.backward() (see L83-L84). This is same as Hessian-vector multiplication. Then, $\alpha B$ is accumulated into $g$ as described in Algorithm 1, and $\alpha A$ is stored in p.grad where p is a parameter tensor in $s$.
Fix $\alpha$ (see L85-L93). This is only required when either weight decay (wd) or momentum (momentum) is not zero.
After meta_backward, the gradient of $\lambda$ (i.e., $g$) is stored into corresponding p.grads. Thus, we just use source_optimizer.step() for updating $\lambda$.

I think our code is more easier to understand when using vanilla SGD (i.e., wd=0 and momentum=0).

alinlab / L2T-ww

Math details about "meta_backward" #8