Closed elikip closed 7 years ago
We also tried to reduce the size of the softmax with sparse updates:
vector
w(candsInt.size()); for (unsigned i = 0; i < v.size(); ++i) w[i]=lookup(*hg, paramp2c, candsInt[i]);
Expression W=concatenate(w); Expression x = W * v; Expression adistecand = log_softmax(x);
and we observe the same problem. Thanks in advance.
@elikip Thanks for sending this! This strongly indicates that there is a problem somewhere in the lines quoted in this post. One more question: does it take a long time for this problem to appear? Or is it pretty quick?
Thank you for your quick response! Regarding your question, this problem appears immediately when updating the first mini-batch.
We also tried using the "ClassFactoredSoftmax" available in dynet framework, and we observe a similar issue with it. Running without auto-batching works (llh&ppl are going down), while the same code with auto-batching does not converge and crashes after a short while. Thanks again.
FYI: I'm working on this and added some tests of autobatched select_rows and log_softmax (here: https://github.com/clab/dynet/pull/598), but neither of them failed in my environment. If you don't mind, could you try to check out this PR (I'll merge it into master soon, so you can update from master when that happens), run test/test-nodes
and see if any tests related to autobatch
fail? If not, then I'll know it's not an environment thing.
It would be even better if you could add a test to test-nodes.cc
that would reproduce your problem, but it's no problem if you can't do that no problem, I'll be able to try again tomorrow or so.
I am reasonably sure that this might be a problem with batched matrix multiplication
Adding the following test case
BOOST_AUTO_TEST_CASE( affine_batch4_gradient ) {
dynet::ComputationGraph cg;
Expression x2 = reshape(parameter(cg, param_cube1), Dim({3,3},3));
Expression inp = input(cg, {3}, ones3_vals);
Expression y = sqrt(x2 * inp);
Expression z = sum_batches(sum_elems(y));
BOOST_CHECK(check_grad(mod, z, 0));
}
Makes the test fail without autobatching. Same with affine transform.
My guess is that (batched matrix)vector might be broken? the problem in @elikip 's code snippet might be that with autobatching, `Wv` becomes batched in both W and v hence the error.
I am not 100% sure I'm going to take a look at the code
Hmm, thanks for the test. i can take a look too, but not for a while so if you figure out the problem that'd be great!
Hmm after further testing the error seems to have stemmed from the use of sqrt, so we're back to square one.
thanks @neubig! and @pmichel31415 after running ./test/test-nodes we get the following in our environment (@elikip and I share the same environment for this):
unknown location(0): fatal error in "contract3d_1d_gradient": std::runtime_error: InnerProduct3D_1D::forward_dev_impl disabled on CUDA. Comment out DYNET_SKIP_CUDA_CONTRACTIONS in nodes-contract.cc to enable this function. /dynet/tests/test-nodes.cc(546): last checkpoint unknown location(0): fatal error in "contract3d_1d_1d_gradient": std::runtime_error: InnerProduct3D_1D_1D::forward_dev_impl disabled on CUDA. Comment out DYNET_SKIP_CUDA_CONTRACTIONS in nodes-contract.cc to enable this function. /dynet/tests/test-nodes.cc(558): last checkpoint unknown location(0): fatal error in "restricted_log_softmax_gradient": std::runtime_error: RestrictedLogSoftmax not yet implemented for CUDA (contributions welcome!) /dynet/tests/test-nodes.cc(743): last checkpoint unknown location(0): fatal error in "sparsemax_gradient": std::runtime_error: Sparsemax not implemented for CUDA /dynet/tests/test-nodes.cc(780): last checkpoint unknown location(0): fatal error in "sparsemax_loss_gradient": std::runtime_error: SparsemaxLoss not implemented for CUDA /dynet/tests/test-nodes.cc(789): last checkpoint unknown location(0): fatal error in "trace_of_product_gradient": std::runtime_error: TraceOfProduct not yet implemented for CUDA /dynet/tests/test-nodes.cc(928): last checkpoint unknown location(0): fatal error in "kmax_pooling_keq1_gradient": std::runtime_error: KMaxPooling::forward_dev_impl not working on CUDA yet /u/miguelba/cfsm/dynet/tests/test-nodes.cc(1155): last checkpoint unknown location(0): fatal error in "kmax_pooling_keq2_gradient": std::runtime_error: KMaxPooling::forward_dev_impl not working on CUDA yet /dynet/tests/test-nodes.cc(1164): last checkpoint /dynet/tests/test-nodes.cc(1230): error in "conv2d_same_gradient": check check_grad(mod, z, 0) failed
does this make sense?
Hmm all these error messages are known issues (mostly stuff not implemented in CUDA). They are most probably not related to your problem.
OK, I think I was able to reproduce this to some extent, so I'll try to figure out the source of the problem in the next day or so.
@neubig thanks for this. After you commit, we just pulled the latest version from master and we tried our code with CFSM and it crashed in the same way when we used dynet-autobatch 1 (if we use dynet-autobatch 0, it works)
We tried with CPU just in case the GPUs were the problem and it turns out that the code works with dynet-autobatch 1 !! so it seems that it is a GPU problem (?) any ideas why? something we can do on our end?
thanks!
I just made a change in #604 that has the potential to fix this problem. Could you try pulling from master one more time and seeing if this fixes the problem?
Same issue (magnitude of gradient is bad. nan). I'm affraid. We use CUDA 8.0 and CUDNN 5.1
no autobatching, cfsm and cpu: 35 seconds per batch autobatching, cfsm and cpu: 9 seconds per batch
no autobatching, cfsm and gpu: 7 seconds per batch autobatching, cfsm and gpu: Does not work.
OK, I'll take another look tomorrow (Asia time). One thing that would really help if you have time is if you could create a test case (similar to the ones in test-nodes.cc or test-rnn.cc) that you can confirm breaks on the GPU.
@neubig we are trying to get the test case. The issue is that we managed to run cfsm with autobatching in the cfsm dynet example code. So there should be something that interacts with our network which is more complex. It seems that trainer.update() crashes, forward(loss) and backward(loss) that happen just before work. Or maybe the gradients are NaN in backward and then when we update it crashes.
In any case, I would expect that autobatch and no autobatch should have the same behavior, am I right? If I run the same code without autobatching flag... it works
The gradient magnitude error is triggered by the update function but the issue probably arises from the forward/batckward computation.
Actually in order to identify the source of the problem it might be worth checking the node/gradients values after the forward/backward pass.
So the test would look like that:
full=true
(see hereCheck for NaNs.
Alternatively you can look at the expression/gradient norms instead since you're just looking for NaNs
@pmichel31415 thanks! how can I print all the expression values? For example, the lstms have inner matrices that we cannot access.
In C++ you can access the weights directly as class attributes, although to be fair the documentation is a bit lacking in this regard. For VanillaLSTMBuilder you should check lstm.param_vars
which is a vector of vector of expressions, the layout of which is explained here.
You can also access all internal states with lstm.h
, lstm.c
which are vectors, one element for each time step.
My guess is that the problem has something to do with the softmax rather than the LSTM but you might as well check everything.
Also there's a more systematic way of checking every node (resp parameter) by accessing them through the computation graph (resp. model) directly, but then you'll lose the info of which node is which
Restricting the class factored softmax to a simple softmax by just returning cnlp here works https://github.com/clab/dynet/blob/master/dynet/cfsm-builder.cc#L100 so yes, I believe that the issue is in the softmax.
The difficult part is that when autobatching is on the behavior is different (in this case it is crashing) so there should be something else we are overlooking.
Graham, I confirm that this works for the CFSM issue. Thanks a lot! I'm closing the issue.
Hi,
We successfully implement a seq2seq model with auto-batching (in GPU) and it works great. We wanted to improve the speed by reducing the size of the softmax:
When not using auto-batching the code works and behaves as expected, however when using the auto-batch we get a runtime error
what(): Magnitude of gradient is bad: inf
Thank you, Eli