apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

[CI] illegal memory access #15925

Open ChaiBapchya opened 5 years ago

ChaiBapchya commented 5 years ago

Multiple gpu tests fail due to illegal memory access

Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered

PR - #15736 Pipeline - http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15736/9/pipeline

Excerpt:

test_operator_gpu.test_np_flatten ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1294594193 to reproduce.
ERROR
test_operator_gpu.test_np_linspace ... [22:52:29] src/operator/tensor/./.././../common/../operator/mxnet_op.h:845: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered
/work/runtime_functions.sh: line 880:     6 Aborted                 (core dumped) nosetests-2.7 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Bug, CI

ChaiBapchya commented 5 years ago

@mxnet-label-bot add [CI, Bug]

ChaiBapchya commented 5 years ago

Another one same PR http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15736/13/pipeline/316

aaronmarkham commented 5 years ago

Happened here too: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-16496/1/pipeline/294

wkcn commented 4 years ago

‘’’

reproduce. Setting test np/mx/python random seeds, use MXNET_TEST_SEED=976443772 to reproduce. ERROR test_operator_gpu.test_np_linalg_slogdet ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1853898693 to reproduce. Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1853898693 to reproduce. ERROR test_operator_gpu.test_np_linalg_svd ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1913742322 to reproduce. Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1913742322 to reproduce. ERROR test_operator_gpu.test_np_linspace ... [22:08:30] src/operator/tensor/./.././../common/../operator/mxnet_op.h:1113: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered /work/runtime_functions.sh: line 1114: 146 Aborted (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS ‘’’

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/master/1359/pipeline

leezu commented 4 years ago

Could the CI issue be related to https://github.com/apache/incubator-mxnet/issues/17713 ? That can be reproduced deterministically on G4 instance

ChaiBapchya commented 4 years ago

G4 instance with cuda10.0 that is?

leezu commented 4 years ago

Yes

szha commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-18146/runs/30/nodes/108/steps/154/log/?start=0

szha commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-18146/runs/33/nodes/109/steps/155/log/?start=0

leezu commented 4 years ago

@ChaiBapchya does this issue still occur on dev environment with updated AMI (in particular with updated drivers)

Given the issue in https://github.com/apache/incubator-mxnet/issues/17713 was due to a bug in cuda, it appears possible that this issue may be due to a bug in the driver..

CC @zhreshold

ChaiBapchya commented 4 years ago

Hasn't occurred so far [15 tests on commits merged into master for unix-gpu pipeline] Will keep monitoring & get back.

szha commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-18562/8/pipeline