apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

GPU tests are unstable #12453

Open lebeg opened 6 years ago

lebeg commented 6 years ago

Description

Multiple CI jobs were failing with CUDA memory problems:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10921/23/pipeline/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1550/pipeline/

Message

Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered

Log with context

test_operator_gpu.test_countsketch ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=104987558 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_basic ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2134146737 to reproduce.
ERROR
test_operator_gpu.test_exc_multiple_waits ... ok
test_operator_gpu.test_lstm_bidirectional ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=200476953 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_setitem ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2082345391 to reproduce.
ERROR
test_operator_gpu.test_exc_post_fail ... ok
test_operator_gpu.test_gru_sym ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1532640391 to reproduce.
ERROR
test_operator_gpu.test_exc_mutable_var_fail ... ok
test_operator_gpu.test_sparse_nd_slice ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1828661033 to reproduce.
ERROR
test_operator_gpu.test_ndarray_elementwise ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1460065938 to reproduce.
ERROR
test_operator_gpu.test_gru_bidirectional ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=16762643 to reproduce.
ERROR
test_operator_gpu.test_ndarray_elementwisesum ... [06:59:47] src/operator/tensor/./.././../common/../operator/mxnet_op.h:622: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 639:     8 Aborted                 (core dumped) nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
vrakesh commented 6 years ago

@lebeg Thanks for reporting this @mxnet-label-bot [Build, Breaking, Test]

aaronmarkham commented 6 years ago

This failed on my PR: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12540/1/pipeline

larroy commented 6 years ago

This is failing again on a GPU instance p3.2xlarge.

time ci/build.py --docker-registry mxnetci --platform ubuntu_build_cuda --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_ubuntu_gpu_mkldnn && time ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_ubuntu_python3_gpu

ERROR test_operator_gpu.test_ndarray_equal ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1470664036 to reproduce. ERROR test_operator_gpu.test_size_array ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1016858059 to reproduce. ERROR test invalid sparse operator will throw a exception ... ok test_operator_gpu.test_ndarray_not_equal ... ok test_operator_gpu.test_nadam ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=440311246 to reproduce. ERROR test check_format for sparse ndarray ... [13:03:09] src/operator/tensor/./.././../common/../operator/mxnet_op.h:649: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:too many resources requested for launch /work/runtime_functions.sh: line 722: 8 Aborted (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu build.py: 2018-10-16 13:03:10,500 Waiting for status of container dd18847ed3fd for 600 s. build.py: 2018-10-16 13:03:10,644 Container exit status: {'StatusCode': 134, 'Error': None} build.py: 2018-10-16 13:03:10,644 Stopping container: dd18847ed3fd build.py: 2018-10-16 13:03:10,646 Removing container: dd18847ed3fd build.py: 2018-10-16 13:03:10,716 Execution of ['/work/runtime_functions.sh', 'unittest_ubuntu_python3_gpu'] failed with status: 134

ChaiBapchya commented 5 years ago

CI failed with similar error - http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12749/67/pipeline/1111

FAIL
test_operator_gpu.test_ndarray_lesser ... [08:27:30] src/operator/tensor/./.././../common/../operator/mxnet_op.h:649: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 718:     8 Aborted                 (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
larroy commented 5 years ago

Can we close this for now?

lebeg commented 5 years ago

Seems it failing still from time to time, right?

larroy commented 5 years ago

Can we close this? @szha

jzhou316 commented 5 years ago

I had the same problem in some of my NMT experiments running of multi-GPUs on p3.2xlarge. It ran some times but failed other times, and the error was not consistent at where it occurred and what messages it displayed. I tested every part of my code without finding any problems. Though it could be my fault, but is it possible that the issue is with MXNet?

some of the error messages

[18:03:47] src/operator/tensor/./.././../common/../operator/mxnet_op.h:680: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
[18:15:03] src/resource.cc:313: Ignore CUDA Error [18:15:03] src/common/random_generator.cu:70: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
larroy commented 5 years ago

@jzhou316 thanks for pointing this out. Could you give more info about the environment in which this happened? is it running in EC2? How difficult you think is to reproduce? Is there a way to reproduce it every time?

Thanks.