elikip commented 7 years ago

Hi,

We successfully implement a seq2seq model with auto-batching (in GPU) and it works great. We wanted to improve the speed by reducing the size of the softmax:

Expression W = select_rows(p2c,candsInt); Expression x = W * v; Expression candidates = log_softmax(x);

When not using auto-batching the code works and behaves as expected, however when using the auto-batch we get a runtime error what(): Magnitude of gradient is bad: inf

Thank you, Eli

elikip commented 7 years ago

We also tried to reduce the size of the softmax with sparse updates:

vector w(candsInt.size());

for (unsigned i = 0; i < v.size(); ++i) w[i]=lookup(*hg, paramp2c, candsInt[i]);

Expression W=concatenate(w); Expression x = W * v; Expression adistecand = log_softmax(x);

and we observe the same problem. Thanks in advance.

neubig commented 7 years ago

@elikip Thanks for sending this! This strongly indicates that there is a problem somewhere in the lines quoted in this post. One more question: does it take a long time for this problem to appear? Or is it pretty quick?

elikip commented 7 years ago

Thank you for your quick response! Regarding your question, this problem appears immediately when updating the first mini-batch.

elikip commented 7 years ago

We also tried using the "ClassFactoredSoftmax" available in dynet framework, and we observe a similar issue with it. Running without auto-batching works (llh&ppl are going down), while the same code with auto-batching does not converge and crashes after a short while. Thanks again.

neubig commented 7 years ago

FYI: I'm working on this and added some tests of autobatched select_rows and log_softmax (here: https://github.com/clab/dynet/pull/598), but neither of them failed in my environment. If you don't mind, could you try to check out this PR (I'll merge it into master soon, so you can update from master when that happens), run test/test-nodes and see if any tests related to autobatch fail? If not, then I'll know it's not an environment thing.

It would be even better if you could add a test to test-nodes.cc that would reproduce your problem, but it's no problem if you can't do that no problem, I'll be able to try again tomorrow or so.

pmichel31415 commented 7 years ago

I am reasonably sure that this might be a problem with batched matrix multiplication

Adding the following test case

BOOST_AUTO_TEST_CASE( affine_batch4_gradient ) {
  dynet::ComputationGraph cg;
  Expression x2 = reshape(parameter(cg, param_cube1), Dim({3,3},3));
  Expression inp = input(cg, {3}, ones3_vals);
  Expression y = sqrt(x2 * inp);
  Expression z = sum_batches(sum_elems(y));
  BOOST_CHECK(check_grad(mod, z, 0));
}

Makes the test fail without autobatching. Same with affine transform.

My guess is that (batched matrix)vector might be broken? the problem in @elikip 's code snippet might be that with autobatching, `Wv` becomes batched in both W and v hence the error.

I am not 100% sure I'm going to take a look at the code

neubig commented 7 years ago

Hmm, thanks for the test. i can take a look too, but not for a while so if you figure out the problem that'd be great!

pmichel31415 commented 7 years ago

Hmm after further testing the error seems to have stemmed from the use of sqrt, so we're back to square one.

miguelballesteros commented 7 years ago

thanks @neubig! and @pmichel31415 after running ./test/test-nodes we get the following in our environment (@elikip and I share the same environment for this):

unknown location(0): fatal error in "contract3d_1d_gradient": std::runtime_error: InnerProduct3D_1D::forward_dev_impl disabled on CUDA. Comment out DYNET_SKIP_CUDA_CONTRACTIONS in nodes-contract.cc to enable this function. /dynet/tests/test-nodes.cc(546): last checkpoint unknown location(0): fatal error in "contract3d_1d_1d_gradient": std::runtime_error: InnerProduct3D_1D_1D::forward_dev_impl disabled on CUDA. Comment out DYNET_SKIP_CUDA_CONTRACTIONS in nodes-contract.cc to enable this function. /dynet/tests/test-nodes.cc(558): last checkpoint unknown location(0): fatal error in "restricted_log_softmax_gradient": std::runtime_error: RestrictedLogSoftmax not yet implemented for CUDA (contributions welcome!) /dynet/tests/test-nodes.cc(743): last checkpoint unknown location(0): fatal error in "sparsemax_gradient": std::runtime_error: Sparsemax not implemented for CUDA /dynet/tests/test-nodes.cc(780): last checkpoint unknown location(0): fatal error in "sparsemax_loss_gradient": std::runtime_error: SparsemaxLoss not implemented for CUDA /dynet/tests/test-nodes.cc(789): last checkpoint unknown location(0): fatal error in "trace_of_product_gradient": std::runtime_error: TraceOfProduct not yet implemented for CUDA /dynet/tests/test-nodes.cc(928): last checkpoint unknown location(0): fatal error in "kmax_pooling_keq1_gradient": std::runtime_error: KMaxPooling::forward_dev_impl not working on CUDA yet /u/miguelba/cfsm/dynet/tests/test-nodes.cc(1155): last checkpoint unknown location(0): fatal error in "kmax_pooling_keq2_gradient": std::runtime_error: KMaxPooling::forward_dev_impl not working on CUDA yet /dynet/tests/test-nodes.cc(1164): last checkpoint /dynet/tests/test-nodes.cc(1230): error in "conv2d_same_gradient": check check_grad(mod, z, 0) failed

does this make sense?

pmichel31415 commented 7 years ago

Hmm all these error messages are known issues (mostly stuff not implemented in CUDA). They are most probably not related to your problem.

neubig commented 7 years ago

OK, I think I was able to reproduce this to some extent, so I'll try to figure out the source of the problem in the next day or so.

miguelballesteros commented 7 years ago

@neubig thanks for this. After you commit, we just pulled the latest version from master and we tried our code with CFSM and it crashed in the same way when we used dynet-autobatch 1 (if we use dynet-autobatch 0, it works)

We tried with CPU just in case the GPUs were the problem and it turns out that the code works with dynet-autobatch 1 !! so it seems that it is a GPU problem (?) any ideas why? something we can do on our end?

thanks!

neubig commented 7 years ago

I just made a change in #604 that has the potential to fix this problem. Could you try pulling from master one more time and seeing if this fixes the problem?

miguelballesteros commented 7 years ago

Same issue (magnitude of gradient is bad. nan). I'm affraid. We use CUDA 8.0 and CUDNN 5.1

no autobatching, cfsm and cpu: 35 seconds per batch autobatching, cfsm and cpu: 9 seconds per batch

no autobatching, cfsm and gpu: 7 seconds per batch autobatching, cfsm and gpu: Does not work.

neubig commented 7 years ago

OK, I'll take another look tomorrow (Asia time). One thing that would really help if you have time is if you could create a test case (similar to the ones in test-nodes.cc or test-rnn.cc) that you can confirm breaks on the GPU.

miguelballesteros commented 7 years ago

@neubig we are trying to get the test case. The issue is that we managed to run cfsm with autobatching in the cfsm dynet example code. So there should be something that interacts with our network which is more complex. It seems that trainer.update() crashes, forward(loss) and backward(loss) that happen just before work. Or maybe the gradients are NaN in backward and then when we update it crashes.

In any case, I would expect that autobatch and no autobatch should have the same behavior, am I right? If I run the same code without autobatching flag... it works

pmichel31415 commented 7 years ago

The gradient magnitude error is triggered by the update function but the issue probably arises from the forward/batckward computation.

Actually in order to identify the source of the problem it might be worth checking the node/gradients values after the forward/backward pass.

So the test would look like that:

Build graph
Call forward as usual, backward with full=true (see here
Print all the expression values
Print all the expression gradients

Check for NaNs.

Alternatively you can look at the expression/gradient norms instead since you're just looking for NaNs

miguelballesteros commented 7 years ago

@pmichel31415 thanks! how can I print all the expression values? For example, the lstms have inner matrices that we cannot access.

pmichel31415 commented 7 years ago

In C++ you can access the weights directly as class attributes, although to be fair the documentation is a bit lacking in this regard. For VanillaLSTMBuilder you should check lstm.param_vars which is a vector of vector of expressions, the layout of which is explained here.

You can also access all internal states with lstm.h, lstm.c which are vectors, one element for each time step.

My guess is that the problem has something to do with the softmax rather than the LSTM but you might as well check everything.

Also there's a more systematic way of checking every node (resp parameter) by accessing them through the computation graph (resp. model) directly, but then you'll lose the info of which node is which

miguelballesteros commented 7 years ago

Restricting the class factored softmax to a simple softmax by just returning cnlp here works https://github.com/clab/dynet/blob/master/dynet/cfsm-builder.cc#L100 so yes, I believe that the issue is in the softmax.

The difficult part is that when autobatching is on the behavior is different (in this case it is crashing) so there should be something else we are overlooking.

neubig commented 7 years ago

622 could very likely fix this problem. I'll merge it as soon as tests pass.

miguelballesteros commented 7 years ago

Graham, I confirm that this works for the CFSM issue. Thanks a lot! I'm closing the issue.

clab / dynet

Auto-batching 'inf' gradient #584

622 could very likely fix this problem. I'll merge it as soon as tests pass.