lightingghost commented 6 years ago

As I saw mxnet has autograd package to support high order gradients, I tried to implement wgan-gp with mxnet. But I got an error of

mxnet.base.MXNetError: [08:32:17] C:\projects\mxnet-distro-win\mxnet-build\nnvm\src\pass\gradient.cc:187: Operator _backward_Convolution is non-differentiable because it didn't register FGradient attribute.

It seems convolution operator still does not support higher order gradients?

sxjscience commented 6 years ago

Supporting higher order gradients is on the agenda. Refer to the discussion in https://github.com/apache/incubator-mxnet/issues/5699. Need to ping Eric @piiswrong

sxjscience commented 6 years ago

I find the issue is similar as https://github.com/apache/incubator-mxnet/issues/9979 . Let's move the discussion there.

lightingghost commented 6 years ago

@sxjscience I think #9979 is a question rather than a feature request. In his case, he should use autograd.grad and set create_graph=True to solve the problem. However, in this case, it seems second order derivative is not implemented for some operators?

piiswrong commented 6 years ago

2nd order gradient is only implemented for a few operators like * and exp.

aodhan-domhnaill commented 6 years ago

I looked at the code and it seems like the problem is that the current implementation is kind of only doing symbolic math down one level. For example, sin,

// sin                                                                                                                                                                                                      
MXNET_OPERATOR_REGISTER_UNARY_WITH_RSP(sin, cpu, mshadow_op::sin)
MXNET_ADD_SPARSE_OP_ALIAS(sin)
.describe(R"code(Computes the element-wise sine of the input array.                                                                                                                                         

The input should be in radians (:math:`2\pi` rad equals 360 degrees).                                                                                                                                       

.. math::                                                                                                                                                                                                   
   sin([0, \pi/4, \pi/2]) = [0, 0.707, 1]                                                                                                                                                                   

The storage type of ``sin`` output depends upon the input storage type:                                                                                                                                     

   - sin(default) = default                                                                                                                                                                                 
   - sin(row_sparse) = row_sparse                                                                                                                                                                           

)code" ADD_FILELINE)
.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "_backward_sin" });

MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_sin, unary_bwd<mshadow_op::sin_grad>);

The fact that it names it _backward_sin instead of using cos is probably why this can't chain the differentiation down.

sxjscience commented 6 years ago

@lightingghost Borrow the discussion in https://github.com/apache/incubator-mxnet/issues/9979 here

import mxnet.ndarray as nd
from mxnet import autograd

x = nd.array([3.0])
x.attach_grad()
with autograd.record():
    y = x**2
    y_grad = autograd.grad(y, x, create_graph=True, retain_graph=True)[0]
    z = y_grad ** 2
z.backward()
print(z.grad)

MXNetError: [12:44:29] src/pass/gradient.cc:187: Operator _backward_power_scalar is non-differentiable because it didn't register FGradient attribute.

sxjscience commented 6 years ago

@aidan-plenert-macdonald That's correct, we need to check the backward pass of these operators and register the gradient.

sxjscience commented 6 years ago

We need to investigate the current OPs and check whether their backward OPs have registered the gradient. Also, we need to add a new testing utility to help check the correctness of our implementation.

aodhan-domhnaill commented 6 years ago

@sxjscience Would it not be better to register the backward OPs as combinations of the forward ones so the gradients can be computed endlessly? Basically not limiting the functionality to 2nd order, but the option of going to nth order

sxjscience commented 6 years ago

@aidan-plenert-macdonald Yes, we should do that.

sxjscience commented 6 years ago

@aidan-plenert-macdonald I suggest we should first classify the OPs into three categories.

[ ] OPs that we can directly register their backward_ops to be some specific forward OPs, e.g, all element-wise operators + reduce/broadcasting operators, dot/batch_dot, .... Most OPs in the tensor folder should satisfy this criteria.
[ ] OPs that may need more investigation in how to register the backward, e.g, most OPs in the nn folder including Convolution/Deconvolution/Pooling, etc.
[ ] Some other corner cases like OPs in the contrib folder and random folder

sxjscience commented 6 years ago

@lightingghost @aidan-plenert-macdonald Would you have time to work on this? You may refer to the implementation of transpose, which uses itself as the backward OP https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/matrix_op.cc#L315-L333. We can first work on supporting the easier cases like

[ ] unary operators, https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/elemwise_unary_op_basic.cc
[ ] broadcasting and reduce, https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce_op_value.cc
[ ] dot, batch_dot, https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/dot.cc

lightingghost commented 6 years ago

@sxjscience I would love to but unfortunately I am not familiar with c++, nor with the mxnet basic architecture. I need to get acquaintance with them first, which may take time.

aodhan-domhnaill commented 6 years ago

@sxjscience Sure. I'll take the unary. I'll be working out of https://github.com/aidan-plenert-macdonald/incubator-mxnet/tree/n-derivative.

sxjscience commented 6 years ago

@lightingghost That's okay. You can ask any questions in Github or in (https://discuss.mxnet.io) and we will reply you ASAP. @aidan-plenert-macdonald Thanks a lot! I'll keep an eye on that. Just cc me in the future PR. Also, you can ask questions in the forum if you have any.

aodhan-domhnaill commented 6 years ago

@sxjscience How good are the current unit tests? If I make the changes will the unit tests catch the change? And currently, the python unit tests are failing. Is this something expected?

sxjscience commented 6 years ago

The current unit test should be correct and we rely on it to test the correctness of PRs. Do you mean that the unit test fails without changing any code?

Get Outlook for iOShttps://aka.ms/o0ukef

From: Aidan Macdonald notifications@github.com Sent: Tuesday, March 20, 2018 9:27:54 PM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] General support of OPs for second-order gradient (#10002)

@sxjsciencehttps://github.com/sxjscience How good are the current unit tests? If I make the changes will the unit tests catch the change? And currently, the python unit tests are failing. Is this something expected?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/apache/incubator-mxnet/issues/10002#issuecomment-374831401, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE8D7i5PH3HqWHEWFBgGbng_mpzeJCsPks5tgdbKgaJpZM4SeTsU.

aodhan-domhnaill commented 6 years ago

Yeah. See https://github.com/apache/incubator-mxnet/issues/10182

dmacd commented 6 years ago

Hi there. My team is keenly interested in building on mxnet but we are starting to look at techniques requiring higher order derivatives (also wgan-gp, in particular). Is this capability being actively developed?

aodhan-domhnaill commented 6 years ago

@dmacd Probably not as fast as you would like it to be. This is my side-side project, so I put a few hours in every week or so. If you want to take lead on the development, I could help where I can.

feevos commented 6 years ago

Dear all, it will be great to have 2nd order derivative support. There are some promising GAN training methods that use 2nd order derivatives (e.g. wgan-gp, SGA etc) for stabilizing training.

Thank you very much for all your efforts.

JohnCalhoun commented 6 years ago

I think this would be a great feature to add to MXNet and I would like to work to make this happen. I have the C++/Python skills to do but I am new to the MXNet backend. I think I can follow the examples of the existing operators that support higher order gradients, (transpose, *, exp). can some one experienced help me understand how to get started? maybe @ piiswrong?

eric-haibin-lin commented 6 years ago

@sxjscience do you have a design for higher order gradient? @JohnCalhoun offers to help implement some ops

piyushghai commented 6 years ago

@JohnCalhoun Here's a useful developer guide that you can follow to get started : https://cwiki.apache.org/confluence/display/MXNET/Development

aodhan-domhnaill commented 6 years ago

@JohnCalhoun I have experience doing things like this. Below is a little info I have from when I started. Unfortunately, I am doing a lot at the moment, so I can't be the main contributor.

Trig Example

To start, let's just look at unary trig ops. If you look at this line you will notice the backwards op. Note that it is called "backward sin" instead of just being a kind of reference to Cosine.

The macro references a series of macros starting here, then NNVM here. We can see a rough usage instructions here.

So it would appear that we need to fix the lines below like,

.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "_backward_sin" });

and replace the ElemwiseGradUseIn with something that likes the gradients back to back.

It appears that in the following line, the _backward_sin is registered using macros in the same way,

MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_sin, unary_bwd<mshadow_op::sin_grad>);

It looks like all of these are registered here. Notice that only the op registers a gradient,

// sin
MXNET_OPERATOR_REGISTER_UNARY_WITH_RSP_CSR(sin, cpu, mshadow_op::sin)
.describe(R"code(Computes the element-wise sine of the input array.
The input should be in radians (:math:`2\pi` rad equals 360 degrees).
.. math::
   sin([0, \pi/4, \pi/2]) = [0, 0.707, 1]
The storage type of ``sin`` output depends upon the input storage type:
   - sin(default) = default
   - sin(row_sparse) = row_sparse
   - sin(csr) = csr
)code" ADD_FILELINE)
.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "_backward_sin" });

MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_sin, unary_bwd<mshadow_op::sin_grad>);

but the _backward_sin doesn't register a gradient. This is most likely why we only get one level of gradient.

Possible Solution

Given this, my naive intuition is that a simple swap of,

.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "_backward_sin" });

for,

.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "cos" });

might give second order gradients for sin.

Sorry if that is confusing. I did similar work to this for Tensorflow, so I can help out as needed.

JohnCalhoun commented 6 years ago

Thank you @aidan-plenert-macdonald ! So once I get up to speed and get an environment together it sounds like the first steps are

implement the sin FGradient as you have suggested using cos
test that you can then do the second order gradient of sin
Implement the cos FGradient as sin
test that you can do a third order gradient of sin
From there the work will be implemention for all operations.

does that sound about right?

apeforest commented 6 years ago

@JohnCalhoun I am also interested in implementing the higher order derivative auto calculation in MXNet backend. In fact, we have already created an epic ticket (https://issues.apache.org/jira/browse/MXNET-978) to track the progress and effort required. I also think we may want to architect this design carefully in order to make it scalable and easy to debug in the future. I was wondering if we can coordinate the effort together with others who might also be interested @sxjscience @aidan-plenert-macdonald @samskalicky?

apeforest commented 6 years ago

Another issue related to this feature: https://github.com/apache/incubator-mxnet/issues/12529

aodhan-domhnaill commented 5 years ago

@JohnCalhoun Yeah. you have the right idea. With those, it should be very simple, other ones may become more complex, but of course start simple and work up.

Here are the tests that sort of test the gradients. Mostly just testing that the backward pass op works. Similar code here suggests that somewhere, gradients can be figured out. It's not clear what that line does though.

@cjolivier01, we are trying to set up second order gradients in MXNet, and in the test code we see that you mention that the engine can automatically figure out the gradient for an op here. We would like to have it automatically figure out the second order gradient and test it. What methods determine the registered gradient of an operator?

@apeforest Loop me in when you plan

JohnCalhoun commented 5 years ago

@aidan-plenert-macdonald should we write test that check for accuracy of gradient calculations and not just that the op runs? also it looks like there is not full test coverage of gradients and ops in those files? (this discussion might better belong in a different thread)

aodhan-domhnaill commented 5 years ago

@JohnCalhoun I would say to do testing in 3 different stages. IMO, each stage should be a separate PR.

At first, simple check that each op (or the ops we want to test) have a second order gradient registered. Meaning get the registered gradient and then assert that the gradient also has a gradient.
Run the second order gradient and assert that output has the expected shape. Meaning 3x4 matrix in, 3x4 matrix out, or whatever we'd expect.
Finite difference check on the second order gradients.

I recommend doing the first one before even implementing any ops. That way, you can quickly check which ops need to be fixed. Perhaps the gradient is wrong, but just forget that at this point. There are a lot of ops to fix. A simple scan to make sure that all the ops have second order gradients will be huge. Once this is all done, we could probably merge to master as is even without checking the gradients.

Once we have a significant amount of ops fixed, then check the shape to catch simple errors. May seem silly, but shape errors are easy to catch and cause a lot of problems downstream.

Only then, do a finite difference test. Hopefully, you will have minimal errors at this point.

aodhan-domhnaill commented 5 years ago

@JohnCalhoun I believe I figured out how to get the gradient. See GetBackward()

op_ is set here and defined as nnvm::Op here

JohnCalhoun commented 5 years ago

@aidan-plenert-macdonald I like your idea, so the steps would be:

write test case that
1. lists all ops
2. for each op check for gradient
3. for each gradient check for gradient
PR
begin writing ops, for each add test case that checks shape
PR
write finite difference tests
PR

Writing each op should be self contained, So once we have a list of all ops need we can distribute individual ops to people ( to want to help such as myself and @apeforest ) and check off ops as they are finished.

JohnCalhoun commented 5 years ago

progress on first part:

TEST(CORE_OP_RUNNER, CheckForGradients) {
    std::vector<std::string> names= dmlc::Registry<Op>::ListAllNames();
    std::cout << "total ops: " << names.size() << "\n";
    for (std::vector<std::string>::iterator i = names.begin(); i != names.end(); ++i){
        const Op* op=dmlc::Registry<Op>::Find(*i);
        auto gradient=op->GetAttr<nnvm::FGradient>("FGradient");
        std::cout << op->name << " " << gradient.count(op) << '\n';
    }
}

this is the output out.txt

aodhan-domhnaill commented 5 years ago

@JohnCalhoun Yeah, that looks really good. I guess we need to figure out if we can somehow to register compound ops so we can do things like cos(x) -> -sin(x). I'll look into that. I'm happy to help writing simple ops once we get there. You could make the test assemble a graph of ops to gradients and run something like Tarjan's Algorithm to make sure the ops are fully connected, or at least get a count of how many are fully connected.

JohnCalhoun commented 5 years ago

@aidan-plenert-macdonald so I have been looking in to how the unary ops work. I don't think it is possible to register compound ops. It seems the engine is not doing arbitrary differentiation, it just looks up what was registered as the derivative. so -sin(x) has to be its own op and there is no way to describe it is mul(-1,sin(x)) with mul and sin being ops. Am I understanding this right?

sin(x)->cos(x) is simple but I think that is a corner case.

It might now be possible to support arbitrary order gradients with out writing an endless amount of backward* ops. Writing enough ops to support second order gradients is for sure possible.

JohnCalhoun commented 5 years ago

more updates: I tried out the sin(x)->cos(x) so code looked like this:

// sin                                                                                                                                                                                                      
MXNET_OPERATOR_REGISTER_UNARY_WITH_RSP(sin, cpu, mshadow_op::sin)
MXNET_ADD_SPARSE_OP_ALIAS(sin)
.describe(R"code(Computes the element-wise sine of the input array.                                                                                                                                         

The input should be in radians (:math:`2\pi` rad equals 360 degrees).                                                                                                                                       

.. math::                                                                                                                                                                                                   
   sin([0, \pi/4, \pi/2]) = [0, 0.707, 1]                                                                                                                                                                   

The storage type of ``sin`` output depends upon the input storage type:                                                                                                                                     

   - sin(default) = default                                                                                                                                                                                 
   - sin(row_sparse) = row_sparse                                                                                                                                                                           

)code" ADD_FILELINE)
.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "cos" });

MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_sin, unary_bwd<mshadow_op::sin_grad>);

but it does not work. this is because ElemwiseGradUseIn{ "cos"} != _backward_sin the math is similar but the function signatures dont match. here is the error

C++ exception with description "[19:00:10] ../src/io/../operator/elemwise_op_common.h:176: Check failed: in_attrs->size() == static_cast(n_in) (2 vs. 1) in operator

_backward_sin is actualy a binary function (takes in input and gradient) while cos is a unary operation.

notice the line

MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_sin, unary_bwd<mshadow_op::sin_grad>);

is registering _backward_sin as a BINARY operator. Digging through the code, that macro does some magic that allows mshadow_op::sin_grad to take in two arguments but ignore the second.

it does not seem feasible to try and support arbitrary order gradients. but supporting second order gradients should still be good to go.

sxjscience commented 5 years ago

@JohnCalhoun @aidan-plenert-macdonald I've supported the higher-order gradient of sin/cos in https://github.com/apache/incubator-mxnet/pull/12821 We can set the FGradient to be a combination of the forward nodes. If you have time, you can try to support the other unitary operators.

JohnCalhoun commented 5 years ago

@sxjscience your example is extremely helpful and provides a great example of how to implement the other operators. is there any good documentation on how the nnvm:Nodes work? or do I just need to study the code more?

Also It would be great to see how you would write test for these.

eric-haibin-lin commented 5 years ago

Shall we create a check list of targeted 2nd order grad operators with high, medium and low priority? @sxjscience

sxjscience commented 5 years ago

Yes, we should create one.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Haibin Lin notifications@github.com Sent: Monday, October 15, 2018 2:29:21 PM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] General support of OPs for second-order gradient (#10002)

Shall we create a check list of targeted 2nd order grad operators with high, medium and low priority? @sxjsciencehttps://github.com/sxjscience

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/apache/incubator-mxnet/issues/10002#issuecomment-429723432, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE8D7tS__Gz4MpufZn0FBpvTEPd77BpFks5ulCtBgaJpZM4SeTsU.

JohnCalhoun commented 5 years ago

where would be the appropriate place to put that checklist? in an issue? or in a JIRA ticket?

probably in an issue so we can mark commits to the ops

sxjscience commented 5 years ago

Issue +1

Get Outlook for iOShttps://aka.ms/o0ukef

From: John M Calhoun notifications@github.com Sent: Tuesday, October 16, 2018 12:13:41 AM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] General support of OPs for second-order gradient (#10002)

where would be the appropriate place to put that checklist? in an issue? or in a JIRA ticket?

probably in an issue so we can mark commits to the ops

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/apache/incubator-mxnet/issues/10002#issuecomment-429917315, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE8D7qF9RLob8FGekjuzj86ogPAxFq7Oks5ulLQ1gaJpZM4SeTsU.

leopd commented 5 years ago

+1 to prioritizing. I think we should pick an order that allows real applications to be built as quickly as possible. There are lots of applications like better optimization algorithms, GAN training, RL algorithms, neural architecture search. Each of these require having the second derivative for every op in a network. So we should pick useful network architectures where we can get the 2nd derivative for the entire network.

In order to get something working as quickly as possible, that implies to me starting with the simplest useful network architectures, and then moving towards progressively more complex architectures ordered by how useful/important they are. This makes me think the order should be approximately:

Fully-connected feedforward networks (multi-layer perceptron MLP)
CNN. Starting with AlexNet (simple) and then adding ResNet (common) and similar
RNN. Start with simple stacked RNN, add LSTM, GRU, encoder-decoder, attention, transformer,
Everything else

Something like that for order of architecture types. But I do think it makes sense to start with MLP since that's the easiest way to get an end-to-end example working, and does cover some interesting real-world use cases. Also MLP requires a pretty short list of ops. I think it's basically:

Fully-connected layer (vector-matrix product)
Softmax output (most common output)
ReLU activation (most common, also trivial 2nd derivative)
Dropout (not required for MLP, but very commonly used)
Batch-Norm (not required for MLP, but quite useful in my experience)
Anything else?

JohnCalhoun commented 5 years ago

good note @leopd . where would be the appropriate place to track progress and priorities on these 100+ ops? in a github issue? an issue for each op? or use the JIRA tickets?

apeforest commented 5 years ago

There is already an epic created: https://issues.apache.org/jira/browse/MXNET-978 I suggest we add tasks/stories to this epic and prioritize in there.

JohnCalhoun commented 5 years ago

I am good with that. I can put together a list of all ops that need FGradients defined

JohnCalhoun commented 5 years ago

related question. Ignoring prioritization, are there ops where gradients dont make sense? or ops that should not have an FGradient?

the following list has all ops that do not have an FGradient registered. there are 213 of them so we should iron out a good process for assigning and tracking progress. Some ops will be dependent on others and some might be able to be removed etc... list.txt

JohnCalhoun commented 5 years ago

update: posted checklist on jira issue https://issues.apache.org/jira/browse/MXNET-978

apeforest commented 5 years ago

@sandeep-krishnamurthy please add label [Operator]

apache / mxnet

General support of OPs for second-order gradient #10002

Trig Example

Possible Solution