Open ChaiBapchya opened 5 years ago
Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Bug, CI
@mxnet-label-bot add [CI, Bug]
‘’’
reproduce. Setting test np/mx/python random seeds, use MXNET_TEST_SEED=976443772 to reproduce. ERROR test_operator_gpu.test_np_linalg_slogdet ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1853898693 to reproduce. Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1853898693 to reproduce. ERROR test_operator_gpu.test_np_linalg_svd ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1913742322 to reproduce. Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1913742322 to reproduce. ERROR test_operator_gpu.test_np_linspace ... [22:08:30] src/operator/tensor/./.././../common/../operator/mxnet_op.h:1113: Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:an illegal memory access was encountered /work/runtime_functions.sh: line 1114: 146 Aborted (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS ‘’’
Could the CI issue be related to https://github.com/apache/incubator-mxnet/issues/17713 ? That can be reproduced deterministically on G4 instance
G4 instance with cuda10.0 that is?
Yes
@ChaiBapchya does this issue still occur on dev environment with updated AMI (in particular with updated drivers)
Given the issue in https://github.com/apache/incubator-mxnet/issues/17713 was due to a bug in cuda, it appears possible that this issue may be due to a bug in the driver..
CC @zhreshold
Hasn't occurred so far [15 tests on commits merged into master for unix-gpu pipeline] Will keep monitoring & get back.
Multiple gpu tests fail due to illegal memory access
PR - #15736 Pipeline - http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15736/9/pipeline
Excerpt: