Open lightingghost opened 6 years ago
Supporting higher order gradients is on the agenda. Refer to the discussion in https://github.com/apache/incubator-mxnet/issues/5699. Need to ping Eric @piiswrong
I find the issue is similar as https://github.com/apache/incubator-mxnet/issues/9979 . Let's move the discussion there.
@sxjscience I think #9979 is a question rather than a feature request. In his case, he should use autograd.grad
and set create_graph=True
to solve the problem. However, in this case, it seems second order derivative is not implemented for some operators?
2nd order gradient is only implemented for a few operators like * and exp.
I looked at the code and it seems like the problem is that the current implementation is kind of only doing symbolic math down one level. For example, sin,
// sin
MXNET_OPERATOR_REGISTER_UNARY_WITH_RSP(sin, cpu, mshadow_op::sin)
MXNET_ADD_SPARSE_OP_ALIAS(sin)
.describe(R"code(Computes the element-wise sine of the input array.
The input should be in radians (:math:`2\pi` rad equals 360 degrees).
.. math::
sin([0, \pi/4, \pi/2]) = [0, 0.707, 1]
The storage type of ``sin`` output depends upon the input storage type:
- sin(default) = default
- sin(row_sparse) = row_sparse
)code" ADD_FILELINE)
.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "_backward_sin" });
MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_sin, unary_bwd<mshadow_op::sin_grad>);
The fact that it names it _backward_sin
instead of using cos
is probably why this can't chain the differentiation down.
@lightingghost Borrow the discussion in https://github.com/apache/incubator-mxnet/issues/9979 here
import mxnet.ndarray as nd
from mxnet import autograd
x = nd.array([3.0])
x.attach_grad()
with autograd.record():
y = x**2
y_grad = autograd.grad(y, x, create_graph=True, retain_graph=True)[0]
z = y_grad ** 2
z.backward()
print(z.grad)
MXNetError: [12:44:29] src/pass/gradient.cc:187: Operator _backward_power_scalar is non-differentiable because it didn't register FGradient attribute.
@aidan-plenert-macdonald That's correct, we need to check the backward pass of these operators and register the gradient.
We need to investigate the current OPs and check whether their backward OPs have registered the gradient. Also, we need to add a new testing utility to help check the correctness of our implementation.
@sxjscience Would it not be better to register the backward OPs as combinations of the forward ones so the gradients can be computed endlessly? Basically not limiting the functionality to 2nd order, but the option of going to nth order
@aidan-plenert-macdonald Yes, we should do that.
@aidan-plenert-macdonald I suggest we should first classify the OPs into three categories.
@lightingghost @aidan-plenert-macdonald Would you have time to work on this? You may refer to the implementation of transpose, which uses itself as the backward OP https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/matrix_op.cc#L315-L333. We can first work on supporting the easier cases like
@sxjscience I would love to but unfortunately I am not familiar with c++, nor with the mxnet basic architecture. I need to get acquaintance with them first, which may take time.
@sxjscience Sure. I'll take the unary. I'll be working out of https://github.com/aidan-plenert-macdonald/incubator-mxnet/tree/n-derivative.
@lightingghost That's okay. You can ask any questions in Github or in (https://discuss.mxnet.io) and we will reply you ASAP. @aidan-plenert-macdonald Thanks a lot! I'll keep an eye on that. Just cc me in the future PR. Also, you can ask questions in the forum if you have any.
@sxjscience How good are the current unit tests? If I make the changes will the unit tests catch the change? And currently, the python unit tests are failing. Is this something expected?
The current unit test should be correct and we rely on it to test the correctness of PRs. Do you mean that the unit test fails without changing any code?
Get Outlook for iOShttps://aka.ms/o0ukef
From: Aidan Macdonald notifications@github.com Sent: Tuesday, March 20, 2018 9:27:54 PM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] General support of OPs for second-order gradient (#10002)
@sxjsciencehttps://github.com/sxjscience How good are the current unit tests? If I make the changes will the unit tests catch the change? And currently, the python unit tests are failing. Is this something expected?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/apache/incubator-mxnet/issues/10002#issuecomment-374831401, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE8D7i5PH3HqWHEWFBgGbng_mpzeJCsPks5tgdbKgaJpZM4SeTsU.
Hi there. My team is keenly interested in building on mxnet but we are starting to look at techniques requiring higher order derivatives (also wgan-gp, in particular). Is this capability being actively developed?
@dmacd Probably not as fast as you would like it to be. This is my side-side project, so I put a few hours in every week or so. If you want to take lead on the development, I could help where I can.
Dear all, it will be great to have 2nd order derivative support. There are some promising GAN training methods that use 2nd order derivatives (e.g. wgan-gp, SGA etc) for stabilizing training.
Thank you very much for all your efforts.
I think this would be a great feature to add to MXNet and I would like to work to make this happen. I have the C++/Python skills to do but I am new to the MXNet backend. I think I can follow the examples of the existing operators that support higher order gradients, (transpose, *, exp). can some one experienced help me understand how to get started? maybe @ piiswrong?
@sxjscience do you have a design for higher order gradient? @JohnCalhoun offers to help implement some ops
@JohnCalhoun Here's a useful developer guide that you can follow to get started : https://cwiki.apache.org/confluence/display/MXNET/Development
@JohnCalhoun I have experience doing things like this. Below is a little info I have from when I started. Unfortunately, I am doing a lot at the moment, so I can't be the main contributor.
To start, let's just look at unary trig ops. If you look at this line you will notice the backwards op. Note that it is called "backward sin" instead of just being a kind of reference to Cosine.
The macro references a series of macros starting here, then NNVM here. We can see a rough usage instructions here.
So it would appear that we need to fix the lines below like,
.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "_backward_sin" });
and replace the ElemwiseGradUseIn
with something that likes the gradients back to back.
It appears that in the following line, the _backward_sin
is registered using macros in the same way,
MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_sin, unary_bwd<mshadow_op::sin_grad>);
It looks like all of these are registered here. Notice that only the op registers a gradient,
// sin
MXNET_OPERATOR_REGISTER_UNARY_WITH_RSP_CSR(sin, cpu, mshadow_op::sin)
.describe(R"code(Computes the element-wise sine of the input array.
The input should be in radians (:math:`2\pi` rad equals 360 degrees).
.. math::
sin([0, \pi/4, \pi/2]) = [0, 0.707, 1]
The storage type of ``sin`` output depends upon the input storage type:
- sin(default) = default
- sin(row_sparse) = row_sparse
- sin(csr) = csr
)code" ADD_FILELINE)
.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "_backward_sin" });
MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_sin, unary_bwd<mshadow_op::sin_grad>);
but the _backward_sin
doesn't register a gradient. This is most likely why we only get one level of gradient.
Given this, my naive intuition is that a simple swap of,
.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "_backward_sin" });
for,
.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "cos" });
might give second order gradients for sin.
Sorry if that is confusing. I did similar work to this for Tensorflow, so I can help out as needed.
Thank you @aidan-plenert-macdonald ! So once I get up to speed and get an environment together it sounds like the first steps are
does that sound about right?
@JohnCalhoun I am also interested in implementing the higher order derivative auto calculation in MXNet backend. In fact, we have already created an epic ticket (https://issues.apache.org/jira/browse/MXNET-978) to track the progress and effort required. I also think we may want to architect this design carefully in order to make it scalable and easy to debug in the future. I was wondering if we can coordinate the effort together with others who might also be interested @sxjscience @aidan-plenert-macdonald @samskalicky?
Another issue related to this feature: https://github.com/apache/incubator-mxnet/issues/12529
@JohnCalhoun Yeah. you have the right idea. With those, it should be very simple, other ones may become more complex, but of course start simple and work up.
Here are the tests that sort of test the gradients. Mostly just testing that the backward pass op works. Similar code here suggests that somewhere, gradients can be figured out. It's not clear what that line does though.
@cjolivier01, we are trying to set up second order gradients in MXNet, and in the test code we see that you mention that the engine can automatically figure out the gradient for an op here. We would like to have it automatically figure out the second order gradient and test it. What methods determine the registered gradient of an operator?
@apeforest Loop me in when you plan
@aidan-plenert-macdonald should we write test that check for accuracy of gradient calculations and not just that the op runs? also it looks like there is not full test coverage of gradients and ops in those files? (this discussion might better belong in a different thread)
@JohnCalhoun I would say to do testing in 3 different stages. IMO, each stage should be a separate PR.
I recommend doing the first one before even implementing any ops. That way, you can quickly check which ops need to be fixed. Perhaps the gradient is wrong, but just forget that at this point. There are a lot of ops to fix. A simple scan to make sure that all the ops have second order gradients will be huge. Once this is all done, we could probably merge to master as is even without checking the gradients.
Once we have a significant amount of ops fixed, then check the shape to catch simple errors. May seem silly, but shape errors are easy to catch and cause a lot of problems downstream.
Only then, do a finite difference test. Hopefully, you will have minimal errors at this point.
@JohnCalhoun I believe I figured out how to get the gradient. See GetBackward()
@aidan-plenert-macdonald I like your idea, so the steps would be:
Writing each op should be self contained, So once we have a list of all ops need we can distribute individual ops to people ( to want to help such as myself and @apeforest ) and check off ops as they are finished.
progress on first part:
TEST(CORE_OP_RUNNER, CheckForGradients) {
std::vector<std::string> names= dmlc::Registry<Op>::ListAllNames();
std::cout << "total ops: " << names.size() << "\n";
for (std::vector<std::string>::iterator i = names.begin(); i != names.end(); ++i){
const Op* op=dmlc::Registry<Op>::Find(*i);
auto gradient=op->GetAttr<nnvm::FGradient>("FGradient");
std::cout << op->name << " " << gradient.count(op) << '\n';
}
}
this is the output out.txt
@JohnCalhoun Yeah, that looks really good. I guess we need to figure out if we can somehow to register compound ops so we can do things like cos(x) -> -sin(x). I'll look into that. I'm happy to help writing simple ops once we get there. You could make the test assemble a graph of ops to gradients and run something like Tarjan's Algorithm to make sure the ops are fully connected, or at least get a count of how many are fully connected.
@aidan-plenert-macdonald so I have been looking in to how the unary ops work. I don't think it is possible to register compound ops. It seems the engine is not doing arbitrary differentiation, it just looks up what was registered as the derivative. so -sin(x) has to be its own op and there is no way to describe it is mul(-1,sin(x)) with mul and sin being ops. Am I understanding this right?
sin(x)->cos(x) is simple but I think that is a corner case.
It might now be possible to support arbitrary order gradients with out writing an endless amount of backward* ops. Writing enough ops to support second order gradients is for sure possible.
more updates: I tried out the sin(x)->cos(x) so code looked like this:
// sin
MXNET_OPERATOR_REGISTER_UNARY_WITH_RSP(sin, cpu, mshadow_op::sin)
MXNET_ADD_SPARSE_OP_ALIAS(sin)
.describe(R"code(Computes the element-wise sine of the input array.
The input should be in radians (:math:`2\pi` rad equals 360 degrees).
.. math::
sin([0, \pi/4, \pi/2]) = [0, 0.707, 1]
The storage type of ``sin`` output depends upon the input storage type:
- sin(default) = default
- sin(row_sparse) = row_sparse
)code" ADD_FILELINE)
.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{ "cos" });
MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_sin, unary_bwd<mshadow_op::sin_grad>);
but it does not work. this is because ElemwiseGradUseIn{ "cos"} != _backward_sin the math is similar but the function signatures dont match. here is the error
C++ exception with description "[19:00:10] ../src/io/../operator/elemwise_op_common.h:176: Check failed: in_attrs->size() == static_cast
(n_in) (2 vs. 1) in operator
_backward_sin is actualy a binary function (takes in input and gradient) while cos is a unary operation.
notice the line
MXNET_OPERATOR_REGISTER_BINARY_WITH_SPARSE_CPU_DR(_backward_sin, unary_bwd<mshadow_op::sin_grad>);
is registering _backward_sin as a BINARY operator. Digging through the code, that macro does some magic that allows mshadow_op::sin_grad to take in two arguments but ignore the second.
it does not seem feasible to try and support arbitrary order gradients. but supporting second order gradients should still be good to go.
@JohnCalhoun @aidan-plenert-macdonald I've supported the higher-order gradient of sin/cos in https://github.com/apache/incubator-mxnet/pull/12821 We can set the FGradient to be a combination of the forward nodes. If you have time, you can try to support the other unitary operators.
@sxjscience your example is extremely helpful and provides a great example of how to implement the other operators. is there any good documentation on how the nnvm:Nodes work? or do I just need to study the code more?
Also It would be great to see how you would write test for these.
Shall we create a check list of targeted 2nd order grad operators with high, medium and low priority? @sxjscience
Yes, we should create one.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Haibin Lin notifications@github.com Sent: Monday, October 15, 2018 2:29:21 PM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] General support of OPs for second-order gradient (#10002)
Shall we create a check list of targeted 2nd order grad operators with high, medium and low priority? @sxjsciencehttps://github.com/sxjscience
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/apache/incubator-mxnet/issues/10002#issuecomment-429723432, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE8D7tS__Gz4MpufZn0FBpvTEPd77BpFks5ulCtBgaJpZM4SeTsU.
where would be the appropriate place to put that checklist? in an issue? or in a JIRA ticket?
probably in an issue so we can mark commits to the ops
Issue +1
Get Outlook for iOShttps://aka.ms/o0ukef
From: John M Calhoun notifications@github.com Sent: Tuesday, October 16, 2018 12:13:41 AM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] General support of OPs for second-order gradient (#10002)
where would be the appropriate place to put that checklist? in an issue? or in a JIRA ticket?
probably in an issue so we can mark commits to the ops
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/apache/incubator-mxnet/issues/10002#issuecomment-429917315, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE8D7qF9RLob8FGekjuzj86ogPAxFq7Oks5ulLQ1gaJpZM4SeTsU.
+1 to prioritizing. I think we should pick an order that allows real applications to be built as quickly as possible. There are lots of applications like better optimization algorithms, GAN training, RL algorithms, neural architecture search. Each of these require having the second derivative for every op in a network. So we should pick useful network architectures where we can get the 2nd derivative for the entire network.
In order to get something working as quickly as possible, that implies to me starting with the simplest useful network architectures, and then moving towards progressively more complex architectures ordered by how useful/important they are. This makes me think the order should be approximately:
Something like that for order of architecture types. But I do think it makes sense to start with MLP since that's the easiest way to get an end-to-end example working, and does cover some interesting real-world use cases. Also MLP requires a pretty short list of ops. I think it's basically:
good note @leopd . where would be the appropriate place to track progress and priorities on these 100+ ops? in a github issue? an issue for each op? or use the JIRA tickets?
There is already an epic created: https://issues.apache.org/jira/browse/MXNET-978 I suggest we add tasks/stories to this epic and prioritize in there.
I am good with that. I can put together a list of all ops that need FGradients defined
related question. Ignoring prioritization, are there ops where gradients dont make sense? or ops that should not have an FGradient?
the following list has all ops that do not have an FGradient registered. there are 213 of them so we should iron out a good process for assigning and tracking progress. Some ops will be dependent on others and some might be able to be removed etc... list.txt
update: posted checklist on jira issue https://issues.apache.org/jira/browse/MXNET-978
@sandeep-krishnamurthy please add label [Operator]
As I saw mxnet has
autograd
package to support high order gradients, I tried to implement wgan-gp with mxnet. But I got an error ofIt seems convolution operator still does not support higher order gradients?