apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

[Discussion] Support Higher-order Gradient #5699

Closed leezu closed 6 years ago

leezu commented 7 years ago

mxnet currenlty only supports calculating the gradient of a loss constructed via chaining mxnet.symbol's. This is unfortunately not enough to implement some more advanced training methods, such as the newly poposed Improved Training of Wasserstein GAN.

For that method the loss function itself will contain a reference to the gradient of the network with respect to the inputs.

gradient_penalty

Furthermore for some reinforcement learning methods it is important to compute a Hessian-Vector product. The key to the auto-computation of the Hessian-Vector product and the Hessian matrix is the R-Operator (Right product of Jacobian matrix and the input vector). We can refer to the paper "Fast Exact Multiplication by the Hessian" and the Theano's document of ROp for more details.

For that, we can add another R_forward and R_backward functions in MXNet to compute the R(output) and the R(gradient) over some parameter w. The workflow will be like this. Like the traditional forward-backward computation of the gradient, we first compute net.forward() and net.backward() to get the output/gradient and then use net.R_forward(param, v) and net.R_backward(param, v) to get the Hessian-vector product.

To compute the Hessian matrix, we can call the subprogram of computing the Hessian-vector product multiple times. We can refer to Wiki and Theano's implementation of Hessian.

@sxjscience

piiswrong commented 7 years ago

there is no need for R op. We just need to differentiate d(v^Tg)/dx where g = dL/dx

The problem here is most ops don't support second order derivative. For start you can try adding FGradient to Fully connected layer.

sxjscience commented 7 years ago

Would it cost too much memory if we choose to manually compute the second-oder derivative?

tqchen commented 7 years ago

It is usually fine as long as the objective is an scalar . You don't really need hessian, but functions of gradient

nicklhy commented 7 years ago

Is it possible to implement Minpy's autograd feature in MXNet?

ZihengJiang commented 7 years ago

@nicklhy We are working on it, it is still experimental and not released yet https://github.com/dmlc/mxnet/blob/master/python/mxnet/contrib/autograd.py#L172

furlat commented 7 years ago

Is there any update? Is this being discussed in another 3d?

I would like to implement https://arxiv.org/abs/1706.04859 Sobolev Training for Neural Networks which uses a gradient penalty for knowledge distillation.

furlat commented 7 years ago

Would it work now with gluon?

bravomikekilo commented 7 years ago

Is this problem solved?

yinxiaochuan commented 7 years ago

no

sxjscience commented 6 years ago

I think https://github.com/apache/incubator-mxnet/commit/37a651664854734b396a58598a0a501be5674b7e should have partially solved the problem.