Closed khui closed 5 years ago
@khui Not sure what's going on. But I can find you enable: amazonei_mxnet_p36
there. EI-based environment is specifically used for Elastic Inference. The pip package there is not built for GPU. You should use the gpu based one which is mxnet_p36
Thanks @lanking520 for the answer!
I am using Docker, and the mxnet being used is mxnet-cu92mkl. Do you mean I should instead use pip install mxnet_p36. The conda env where the docker container launched is amazonei_mxnet_p36
, however, as mentioned, the bugs happen in within a docker container. Therefore, I am not sure whether amazonei_mxnet_p36
is relevant anymore. But will try to install mxnet_p36.
In addition, I tried using naive engine by setting MXNET_ENGINE_TYPE=NaiveEngine
. And got following more specific errors:
Traceback (most recent call last):
File "/workdir/code/src/project_main.py", line 154, in <module>
main(args)
File "/workdir/code/src/project_main.py", line 138, in main
do_offline_evaluation=args.do_offline_evaluation)
File "/workdir/code/src/project/estimator/train_pred_eval.py", line 131, in train
ctx=ctx)
File "/workdir/code/src/project/estimator/train_pred_eval.py", line 343, in model_fn
bn_start_logit, bn_end_logit = model(bn_question_tokens, bn_context_tokens)
File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/block.py", line 540, in __call__
out = self.forward(*args)
File "/workdir/code/src/project/estimator/models/model.py", line 90, in forward
att_f_q, att_f_c = self.model.forward(bn_questions, bn_contexts)
File "/workdir/code/src/project/estimator/models/model.py", line 212, in forward
f_q = self.bilstm_q(em_q)
File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/block.py", line 540, in __call__
out = self.forward(*args)
File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/block.py", line 917, in forward
return self.hybrid_forward(ndarray, x, *args, **params)
File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/rnn/rnn_layer.py", line 234, in hybrid_forward
out = self._forward_kernel(F, inputs, states, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/mxnet/gluon/rnn/rnn_layer.py", line 265, in _forward_kernel
lstm_state_clip_nan=self._lstm_state_clip_nan)
File "<string>", line 145, in RNN
File "/usr/local/lib/python3.6/dist-packages/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
ctypes.byref(out_stypes)))
File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [09:32:14] src/operator/./cudnn_rnn-inl.h:710: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3d9c92) [0x7f6a25da7c92]
[bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3da268) [0x7f6a25da8268]
[bt] (2) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x5c50cb4) [0x7f6a2b61ecb4]
[bt] (3) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x5c52af6) [0x7f6a2b620af6]
[bt] (4) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x33f2924) [0x7f6a28dc0924]
[bt] (5) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x361) [0x7f6a28b9d791]
[bt] (6) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext)+0x26) [0x7f6a28b9dde6]
[bt] (7) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3121ef3) [0x7f6a28aefef3]
[bt] (8) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3125ae5) [0x7f6a28af3ae5]
[bt] (9) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x31246d9) [0x7f6a28af26d9]
[09:32:14] src/engine/naive_engine.cc:69: Engine shutdown
(END)
As a note, when reproducing the errors in a jupyter notebook, I got following errors when trying to print out the loss and compute the mean (after getting the errros described earlier).
@lanking520 Could you help to check the following error messages? Please let me know if they provide you any hints. Thanks!!
The loss is.
[ 2.8901496 8.305076 13.280055 4.652643 9.613869 4.837726
5.949163 4.6820254 7.0052347 9.829151 6.4464464 5.3237095
6.1686893 7.799595 10.966969 5.2151794 5.0370407 6.5768747
8.265556 11.412268 6.8640356 5.128555 5.1864567 6.8858347
6.894717 2.467805 8.098482 5.589046 6.557484 11.86685
4.3043194 5.3515797 6.1470346 8.024975 3.422638 16.160294
6.2304115 1.178197 2.866407 3.984875 3.7100368 13.471437
7.4196377 8.543673 8.974239 11.460396 7.1255684 7.1223545
5.4278336 10.207495 5.3622923 7.626067 7.2586136 9.395147
4.973973 7.6694055 10.879036 10.221865 5.520145 11.152739
5.0953455 8.80431 4.323547 7.823736 ]
<NDArray 64 @gpu(0)>
loss.mean()
I got:
---------------------------------------------------------------------------
MXNetError Traceback (most recent call last)
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
700 type_pprinters=self.type_printers,
701 deferred_pprinters=self.deferred_printers)
--> 702 printer.pretty(obj)
703 printer.flush()
704 return stream.getvalue()
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
381 if cls in self.type_pprinters:
382 # printer registered in self.type_pprinters
--> 383 return self.type_pprinters[cls](obj, self, cycle)
384 else:
385 # deferred printer
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)
559 p.text(',')
560 p.breakable()
--> 561 p.pretty(x)
562 if len(obj) == 1 and type(obj) is tuple:
563 # Special case for 1-item tuples.
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
398 if cls is not object \
399 and callable(cls.__dict__.get('__repr__')):
--> 400 return _repr_pprint(obj, self, cycle)
401
402 return _default_pprint(obj, self, cycle)
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
693 """A pprint that just redirects to the normal repr function."""
694 # Find newlines and replace them with p.break_()
--> 695 output = repr(obj)
696 for idx,output_line in enumerate(output.splitlines()):
697 if idx:
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py in __repr__(self)
187 """Returns a string representation of the array."""
188 shape_info = 'x'.join(['%d' % x for x in self.shape])
--> 189 return '\n%s\n<%s %s @%s>' % (str(self.asnumpy()),
190 self.__class__.__name__,
191 shape_info, self.context)
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py in asnumpy(self)
1978 self.handle,
1979 data.ctypes.data_as(ctypes.c_void_p),
-> 1980 ctypes.c_size_t(data.size)))
1981 return data
1982
~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/base.py in check_call(ret)
250 """
251 if ret != 0:
--> 252 raise MXNetError(py_str(_LIB.MXGetLastError()))
253
254
MXNetError: [17:08:34] src/nnvm/legacy_op_util.cc:134: Check failed: fwd_init_
Stack trace returned 10 entries:
[bt] (0) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x40123a) [0x7f7dfd0b623a]
[bt] (1) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x401851) [0x7f7dfd0b6851]
[bt] (2) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2f786d2) [0x7f7dffc2d6d2]
[bt] (3) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x2f0) [0x7f7dffa14bd0]
[bt] (4) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext)+0x26) [0x7f7dffa15246]
[bt] (5) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb06d7) [0x7f7dff9656d7]
[bt] (6) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb06d7) [0x7f7dff9656d7]
[bt] (7) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb06d7) [0x7f7dff9656d7]
[bt] (8) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb06d7) [0x7f7dff9656d7]
[bt] (9) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb06d7) [0x7f7dff9656d7]
@khui If you run the code and got error inside docker container, the issue is very likely to be related to container only.
when you said when reproducing the errors in a jupyter notebook
, did you run jupyter notebook inside docker, or on DLAMI?
Also, I am curious what command you run to run/start docker container.
@mirocody Thanks!
The errors appear when I am using DLAMI. To debug, I ran docker container to exclude the reasons that the bugs come from the mismatched mxnet/cuda/cudnn version. Since such reasons seem unlikely after trying different combinations, I switched back to use DLAMI. The container is ran using following command, thereafter some commands are ran inside the container as usual.
nvidia-docker run --rm -it --name gpu_run -v /home/ec2-user/workspace/output:/workdir/output mxnet_gpu bash
The jupyer notebook is ran on mxnet_p36 per the suggestions from @lanking520.
@khui In DLAMI, we support mxnet with cu90, so not sure the error you got is related to cu92 and cudnn. I would suggest to try latest DLAMI and run your code in conda env, if still not work, you can cut a ticket to us, while the community can still look this issue to see if this is related to mxnet framework.
After setting the dropout=0
for the gluon.rnn.LSTM(...)
, the bugs do not appear again for training lasting for more than 200 epoches. Close this issue for now, and will get back to this to find the in-depth reason when having time.
I got the similiar error
@yuzhoujianxia thanks for reporting. Would you mind filling a new bug report for your case?
smaller batch_size saved me.
Description
When using gpu, the model could be trained for one epoch or so and then I got Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED. This error repeatedly appears when training the model on gpu, but at different time (epoch/batch), even from different line. The same model has been successfully trained on CPU. Any ideas about the possible reasons?
I also posted on the forum. I am not sure whether it is due to the misuses of the mxnet, or to some issues of mxnet. Sorry for the duplicated posts.
Environment info (Required)
What have you tried to solve it?
From nvidia/cuda:9.2-cudnn7-devel-ubuntu16.04 RUN pip install mxnet-cu92mkl gluonnlp "numpy<1.15.0,>=1.8.2" scipy matplotlib mxboard