training model failed after one epoch on GPU: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED

khui commented 5 years ago

Description

When using gpu, the model could be trained for one epoch or so and then I got Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED. This error repeatedly appears when training the model on gpu, but at different time (epoch/batch), even from different line. The same model has been successfully trained on CPU. Any ideas about the possible reasons?

Traceback (most recent call last):
  File "/workdir/code/src/project_main.py", line 144, in <module>
    main(args)
  File "/workdir/code/src/project_main.py", line 128, in main
    do_offline_evaluation=args.do_offline_evaluation)
  File "/workdir/code/src/project/estimator/train_pred_eval.py", line 123, in train
    step_loss = loss.asscalar()
  File "/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py", line 1998, in asscalar
    return self.asnumpy()[0]
  File "/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:09:37] src/operator/./cudnn_rnn-inl.h:710: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3d9c92) [0x7f15a0ac2c92]
[bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3da268) [0x7f15a0ac3268]
[bt] (2) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x5c50cb4) [0x7f15a6339cb4]
[bt] (3) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x5c52af6) [0x7f15a633baf6]
[bt] (4) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x33f2924) [0x7f15a3adb924]
[bt] (5) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x361) [0x7f15a38b8791]
[bt] (6) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext)+0x26) [0x7f15a38b8de6]
[bt] (7) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x312b223) [0x7f15a3814223]
[bt] (8) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3132f64) [0x7f15a381bf64]
[bt] (9) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x313702b) [0x7f15a382002b]

I also posted on the forum. I am not sure whether it is due to the misuses of the mxnet, or to some issues of mxnet. Sorry for the duplicated posts.

Environment info (Required)

----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 7.2.0
Build        : ('default', 'Apr 29 2018 16:14:56')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ec2-user/anaconda3/envs/amazonei_mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.4.0
Directory    : /home/ec2-user/anaconda3/envs/amazonei_mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash   : a03d59ed867ba334d78d61246a1090cd1868f5da
----------System Info----------
Platform     : Linux-4.14.104-78.84.amzn1.x86_64-x86_64-with-glibc2.9
system       : Linux
node         : ip-172-31-30-51
release      : 4.14.104-78.84.amzn1.x86_64
version      : #1 SMP Mon Mar 4 19:19:37 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2624.319
BogoMIPS:              4600.07
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-31
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0023 sec, LOAD: 0.3961 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1183 sec, LOAD: 0.5057 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.3355 sec, LOAD: 0.4640 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0125 sec, LOAD: 0.5077 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0138 sec, LOAD: 0.2946 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0123 sec, LOAD: 0.0758 sec.

What have you tried to solve it?

I tried conda and docker container as environment, excluding the possibilities of mismatched cuda version/cudnn version/mxnet version. In particular, here are the configurations.
```
Tesla V100-SXM2
Driver Version: 410.104
```

From nvidia/cuda:9.2-cudnn7-devel-ubuntu16.04 RUN pip install mxnet-cu92mkl gluonnlp "numpy<1.15.0,>=1.8.2" scipy matplotlib mxboard


2. I also tried different batch size/learning rate, suspecting that there might be numeric issues when lr is large. However, there is no luck even when using very moderate learning rate, and with small batch size.

lanking520 commented 5 years ago

@khui Not sure what's going on. But I can find you enable: amazonei_mxnet_p36 there. EI-based environment is specifically used for Elastic Inference. The pip package there is not built for GPU. You should use the gpu based one which is mxnet_p36

khui commented 5 years ago

Thanks @lanking520 for the answer!

I am using Docker, and the mxnet being used is mxnet-cu92mkl. Do you mean I should instead use pip install mxnet_p36. The conda env where the docker container launched is amazonei_mxnet_p36, however, as mentioned, the bugs happen in within a docker container. Therefore, I am not sure whether amazonei_mxnet_p36 is relevant anymore. But will try to install mxnet_p36.

khui commented 5 years ago

In addition, I tried using naive engine by setting MXNET_ENGINE_TYPE=NaiveEngine. And got following more specific errors:

Traceback (most recent call last):
  File "/workdir/code/src/project_main.py", line 154, in <module>
    main(args)
  File "/workdir/code/src/project_main.py", line 138, in main
    do_offline_evaluation=args.do_offline_evaluation)
  File "/workdir/code/src/project/estimator/train_pred_eval.py", line 131, in train
    ctx=ctx)
  File "/workdir/code/src/project/estimator/train_pred_eval.py", line 343, in model_fn
    bn_start_logit, bn_end_logit = model(bn_question_tokens, bn_context_tokens)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/block.py", line 540, in __call__
    out = self.forward(*args)
  File "/workdir/code/src/project/estimator/models/model.py", line 90, in forward
    att_f_q, att_f_c = self.model.forward(bn_questions, bn_contexts)
  File "/workdir/code/src/project/estimator/models/model.py", line 212, in forward
    f_q = self.bilstm_q(em_q)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/block.py", line 540, in __call__
    out = self.forward(*args)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/block.py", line 917, in forward
    return self.hybrid_forward(ndarray, x, *args, **params)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/rnn/rnn_layer.py", line 234, in hybrid_forward
    out = self._forward_kernel(F, inputs, states, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/rnn/rnn_layer.py", line 265, in _forward_kernel
    lstm_state_clip_nan=self._lstm_state_clip_nan)
  File "<string>", line 145, in RNN
  File "/usr/local/lib/python3.6/dist-packages/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [09:32:14] src/operator/./cudnn_rnn-inl.h:710: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3d9c92) [0x7f6a25da7c92]
[bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3da268) [0x7f6a25da8268]
[bt] (2) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x5c50cb4) [0x7f6a2b61ecb4]
[bt] (3) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x5c52af6) [0x7f6a2b620af6]
[bt] (4) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x33f2924) [0x7f6a28dc0924]
[bt] (5) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x361) [0x7f6a28b9d791]
[bt] (6) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext)+0x26) [0x7f6a28b9dde6]
[bt] (7) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3121ef3) [0x7f6a28aefef3]
[bt] (8) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3125ae5) [0x7f6a28af3ae5]
[bt] (9) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x31246d9) [0x7f6a28af26d9]

[09:32:14] src/engine/naive_engine.cc:69: Engine shutdown
(END)

khui commented 5 years ago

As a note, when reproducing the errors in a jupyter notebook, I got following errors when trying to print out the loss and compute the mean (after getting the errros described earlier).

@lanking520 Could you help to check the following error messages? Please let me know if they provide you any hints. Thanks!!

The loss is.

[ 2.8901496  8.305076  13.280055   4.652643   9.613869   4.837726
  5.949163   4.6820254  7.0052347  9.829151   6.4464464  5.3237095
  6.1686893  7.799595  10.966969   5.2151794  5.0370407  6.5768747
  8.265556  11.412268   6.8640356  5.128555   5.1864567  6.8858347
  6.894717   2.467805   8.098482   5.589046   6.557484  11.86685
  4.3043194  5.3515797  6.1470346  8.024975   3.422638  16.160294
  6.2304115  1.178197   2.866407   3.984875   3.7100368 13.471437
  7.4196377  8.543673   8.974239  11.460396   7.1255684  7.1223545
  5.4278336 10.207495   5.3622923  7.626067   7.2586136  9.395147
  4.973973   7.6694055 10.879036  10.221865   5.520145  11.152739
  5.0953455  8.80431    4.323547   7.823736 ]
<NDArray 64 @gpu(0)>

loss.mean() I got:

---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    381                 if cls in self.type_pprinters:
    382                     # printer registered in self.type_pprinters
--> 383                     return self.type_pprinters[cls](obj, self, cycle)
    384                 else:
    385                     # deferred printer

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)
    559                 p.text(',')
    560                 p.breakable()
--> 561             p.pretty(x)
    562         if len(obj) == 1 and type(obj) is tuple:
    563             # Special case for 1-item tuples.

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    398                         if cls is not object \
    399                                 and callable(cls.__dict__.get('__repr__')):
--> 400                             return _repr_pprint(obj, self, cycle)
    401 
    402             return _default_pprint(obj, self, cycle)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    693     """A pprint that just redirects to the normal repr function."""
    694     # Find newlines and replace them with p.break_()
--> 695     output = repr(obj)
    696     for idx,output_line in enumerate(output.splitlines()):
    697         if idx:

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py in __repr__(self)
    187         """Returns a string representation of the array."""
    188         shape_info = 'x'.join(['%d' % x for x in self.shape])
--> 189         return '\n%s\n<%s %s @%s>' % (str(self.asnumpy()),
    190                                       self.__class__.__name__,
    191                                       shape_info, self.context)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py in asnumpy(self)
   1978             self.handle,
   1979             data.ctypes.data_as(ctypes.c_void_p),
-> 1980             ctypes.c_size_t(data.size)))
   1981         return data
   1982 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/base.py in check_call(ret)
    250     """
    251     if ret != 0:
--> 252         raise MXNetError(py_str(_LIB.MXGetLastError()))
    253 
    254 

MXNetError: [17:08:34] src/nnvm/legacy_op_util.cc:134: Check failed: fwd_init_ 

Stack trace returned 10 entries:
[bt] (0) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x40123a) [0x7f7dfd0b623a]
[bt] (1) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x401851) [0x7f7dfd0b6851]
[bt] (2) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2f786d2) [0x7f7dffc2d6d2]
[bt] (3) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x2f0) [0x7f7dffa14bd0]
[bt] (4) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext)+0x26) [0x7f7dffa15246]
[bt] (5) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb06d7) [0x7f7dff9656d7]
[bt] (6) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb06d7) [0x7f7dff9656d7]
[bt] (7) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb06d7) [0x7f7dff9656d7]
[bt] (8) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb06d7) [0x7f7dff9656d7]
[bt] (9) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb06d7) [0x7f7dff9656d7]

mirocody commented 5 years ago

@khui If you run the code and got error inside docker container, the issue is very likely to be related to container only. when you said when reproducing the errors in a jupyter notebook, did you run jupyter notebook inside docker, or on DLAMI? Also, I am curious what command you run to run/start docker container.

khui commented 5 years ago

@mirocody Thanks!

The errors appear when I am using DLAMI. To debug, I ran docker container to exclude the reasons that the bugs come from the mismatched mxnet/cuda/cudnn version. Since such reasons seem unlikely after trying different combinations, I switched back to use DLAMI. The container is ran using following command, thereafter some commands are ran inside the container as usual. nvidia-docker run --rm -it --name gpu_run -v /home/ec2-user/workspace/output:/workdir/output mxnet_gpu bash

The jupyer notebook is ran on mxnet_p36 per the suggestions from @lanking520.

mirocody commented 5 years ago

@khui In DLAMI, we support mxnet with cu90, so not sure the error you got is related to cu92 and cudnn. I would suggest to try latest DLAMI and run your code in conda env, if still not work, you can cut a ticket to us, while the community can still look this issue to see if this is related to mxnet framework.

khui commented 5 years ago

After setting the dropout=0 for the gluon.rnn.LSTM(...), the bugs do not appear again for training lasting for more than 200 epoches. Close this issue for now, and will get back to this to find the in-depth reason when having time.

yuzhoujianxia commented 4 years ago

I got the similiar error

szha commented 4 years ago

@yuzhoujianxia thanks for reporting. Would you mind filling a new bug report for your case?

Light-- commented 4 years ago

smaller batch_size saved me.

apache / mxnet