Open lebeg opened 6 years ago
@lebeg Thanks for reporting this @mxnet-label-bot [Build, Breaking, Test]
This is failing again on a GPU instance p3.2xlarge.
time ci/build.py --docker-registry mxnetci --platform ubuntu_build_cuda --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_ubuntu_gpu_mkldnn && time ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_ubuntu_python3_gpu
ERROR test_operator_gpu.test_ndarray_equal ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1470664036 to reproduce. ERROR test_operator_gpu.test_size_array ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1016858059 to reproduce. ERROR test invalid sparse operator will throw a exception ... ok test_operator_gpu.test_ndarray_not_equal ... ok test_operator_gpu.test_nadam ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=440311246 to reproduce. ERROR test check_format for sparse ndarray ... [13:03:09] src/operator/tensor/./.././../common/../operator/mxnet_op.h:649: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:too many resources requested for launch /work/runtime_functions.sh: line 722: 8 Aborted (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu build.py: 2018-10-16 13:03:10,500 Waiting for status of container dd18847ed3fd for 600 s. build.py: 2018-10-16 13:03:10,644 Container exit status: {'StatusCode': 134, 'Error': None} build.py: 2018-10-16 13:03:10,644 Stopping container: dd18847ed3fd build.py: 2018-10-16 13:03:10,646 Removing container: dd18847ed3fd build.py: 2018-10-16 13:03:10,716 Execution of ['/work/runtime_functions.sh', 'unittest_ubuntu_python3_gpu'] failed with status: 134
CI failed with similar error - http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12749/67/pipeline/1111
FAIL
test_operator_gpu.test_ndarray_lesser ... [08:27:30] src/operator/tensor/./.././../common/../operator/mxnet_op.h:649: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 718: 8 Aborted (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
Can we close this for now?
Seems it failing still from time to time, right?
Can we close this? @szha
I had the same problem in some of my NMT experiments running of multi-GPUs on p3.2xlarge. It ran some times but failed other times, and the error was not consistent at where it occurred and what messages it displayed. I tested every part of my code without finding any problems. Though it could be my fault, but is it possible that the issue is with MXNet?
some of the error messages
[18:03:47] src/operator/tensor/./.././../common/../operator/mxnet_op.h:680: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
[18:15:03] src/resource.cc:313: Ignore CUDA Error [18:15:03] src/common/random_generator.cu:70: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: an illegal memory access was encountered
@jzhou316 thanks for pointing this out. Could you give more info about the environment in which this happened? is it running in EC2? How difficult you think is to reproduce? Is there a way to reproduce it every time?
Thanks.
Description
Multiple CI jobs were failing with CUDA memory problems:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10921/23/pipeline/
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1550/pipeline/
Message
Log with context