apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

test_operator_gpu.test_embedding_with_type 'an illegal memory access was encountered' #17713

Closed leezu closed 4 years ago

leezu commented 4 years ago

Description

Embedding operator in test_operator_gpu.test_embedding_with_type triggers illegal memory access error deterministically on G4 instance.

Error Message

% nosetests --verbose --stop ../tests/python/gpu/test_operator_gpu.py -m test_embedding_with_type
/home/ubuntu/src/mxnet-master/tests/python/gpu/test_operator_gpu.py:2402: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if req_dict['data'] is 'write':
/home/ubuntu/src/mxnet-master/tests/python/gpu/test_operator_gpu.py:2404: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if req_dict['grid'] is 'write':
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_operator.py:3738: SyntaxWarning: "is" with a literal. Did you mean "=="?
  assert_almost_equal(exe.outputs[0], np_out, rtol=1e-2 if dtype is 'float16' else 1e-5, atol=1e-5)
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_operator.py:3897: SyntaxWarning: "is" with a literal. Did you mean "=="?
  npy_out = l1norm(in_data, i) if order is 1 else l2norm(in_data, i)
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_operator.py:3898: SyntaxWarning: "is" with a literal. Did you mean "=="?
  npy_out_backward = np.sign(in_data) if order is 1 else in_data/npy_out
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_operator.py:3914: SyntaxWarning: "is" with a literal. Did you mean "=="?
  npy_out = l1norm(in_data, (i, i+1)) if order is 1 else l2norm(in_data, (i, i+1))
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_operator.py:3915: SyntaxWarning: "is" with a literal. Did you mean "=="?
  npy_out_backward = np.sign(in_data) if order is 1 else in_data/npy_out
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2293: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if len(func_data) is 4:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2763: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if axis_size is 0:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2775: SyntaxWarning: "is" with a literal. Did you mean "=="?
  sections = 7 if shape[axis] is 0 else shape[axis]
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2816: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if axis_size is 0:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2836: SyntaxWarning: "is" with a literal. Did you mean "=="?
  sections = 7 if x.shape[axis] is 0 else random.randint(1,x.shape[axis])
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2871: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if axis_size is 0:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:2888: SyntaxWarning: "is" with a literal. Did you mean "=="?
  sections = 7 if axis_size is 0 else axis_size
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_numpy_op.py:6194: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if axis is -1:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_random.py:579: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if len(x.shape) is 1:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_ndarray.py:487: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if default_context().device_type is 'gpu':
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_ndarray.py:1043: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if default_context().device_type is 'gpu':
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:363: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if ((lhs_stype is 'default' and rhs_stype is 'row_sparse') or
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:363: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if ((lhs_stype is 'default' and rhs_stype is 'row_sparse') or
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:364: SyntaxWarning: "is" with a literal. Did you mean "=="?
  (lhs_stype is 'default' and rhs_stype is 'csr') or
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:364: SyntaxWarning: "is" with a literal. Did you mean "=="?
  (lhs_stype is 'default' and rhs_stype is 'csr') or
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:365: SyntaxWarning: "is" with a literal. Did you mean "=="?
  (lhs_stype is 'row_sparse' and rhs_stype is 'row_sparse') and (rhs_density == 0.0)):
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:365: SyntaxWarning: "is" with a literal. Did you mean "=="?
  (lhs_stype is 'row_sparse' and rhs_stype is 'row_sparse') and (rhs_density == 0.0)):
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:387: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if ((lhs_stype is 'row_sparse' and rhs_stype is 'row_sparse') and (lhs_density == 0.0)):
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:387: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if ((lhs_stype is 'row_sparse' and rhs_stype is 'row_sparse') and (lhs_density == 0.0)):
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:1272: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if default_context().device_type is 'gpu':
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:1797: SyntaxWarning: "is" with a literal. Did you mean "=="?
  if dim is 2:
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:1805: SyntaxWarning: "is" with a literal. Did you mean "=="?
  stypes = ([all_stypes[pick_type]] if pick_side is 0 else []) + stypes + ([all_stypes[pick_type]] if pick_side is 1 else [])
/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/test_sparse_operator.py:1805: SyntaxWarning: "is" with a literal. Did you mean "=="?
  stypes = ([all_stypes[pick_type]] if pick_side is 0 else []) + stypes + ([all_stypes[pick_type]] if pick_side is 1 else [])
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=94434600 to reproduce.
test_operator_gpu.test_embedding_with_type ... [23:05:48] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7501, which is older than the oldest version tested by CI (7600).  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[23:05:53] ../src/executor/graph_executor.cc:1981: Subgraph backend MKLDNN is activated.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=457645343 to reproduce.
ERROR

======================================================================
ERROR: test_operator_gpu.test_embedding_with_type
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.8.2/lib/python3.8/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/src/mxnet-master/tests/python/gpu/../unittest/common.py", line 215, in test_new
    orig_test(*args, **kwargs)
  File "/home/ubuntu/src/mxnet-master/tests/python/gpu/test_operator_gpu.py", line 1639, in test_embedding_with_type
    test_embedding_helper(data_types, weight_types, 5, 5)
  File "/home/ubuntu/src/mxnet-master/tests/python/gpu/test_operator_gpu.py", line 1634, in test_embedding_helper
    check_consistency(sym, ctx_list, grad_req={'embedding_data': 'null','embedding_weight': 'write'},
  File "/home/ubuntu/src/mxnet-master/python/mxnet/test_utils.py", line 1572, in check_consistency
    assert_almost_equal(arr, gtarr, rtol=rtol, atol=atol, equal_nan=equal_nan)
  File "/home/ubuntu/src/mxnet-master/python/mxnet/test_utils.py", line 602, in assert_almost_equal
    if output.asnumpy() == 1:
  File "/home/ubuntu/src/mxnet-master/python/mxnet/ndarray/ndarray.py", line 2558, in asnumpy
    check_call(_LIB.MXNDArraySyncCopyToCPU(
  File "/home/ubuntu/src/mxnet-master/python/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "../include/mshadow/./stream_gpu-inl.h", line 62
CUDA: Check failed: e == cudaSuccess: an illegal memory access was encountered
-------------------- >> begin captured logging << --------------------
root: INFO: NumPy-shape semantics has been activated in your code. This is required for creating and manipulating scalar and zero-size tensors, which were not supported in MXNet before, as in the official NumPy library. Please DO NOT manually deactivate this semantics while using `mxnet.numpy` and `mxnet.numpy_extension` modules.
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=94434600 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=457645343 to reproduce.
--------------------- >> end captured logging << ---------------------

To Reproduce

nosetests --verbose tests/python/gpu/test_operator_gpu.py -m test_embedding_with_type

Steps to reproduce

  1. On a EC2 g4 instance, compile MXNet with Cuda support, for example: CC=clang-9 CXX=clang++-9 cmake -GNinja -DUSE_MKLDNN=1 -DUSE_CUDA=ON .. ; ninja
  2. Run nosetests --verbose --stop ../tests/python/gpu/test_operator_gpu.py -m test_embedding_with_type

    Environment

----------Python Info----------
Version      : 3.8.2
Compiler     : GCC 7.4.0
Build        : ('default', 'Feb 27 2020 21:58:06')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 19.2.3
Directory    : /home/ubuntu/.pyenv/versions/3.8.2/lib/python3.8/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/ubuntu/src/mxnet-master/python/mxnet
Num GPUs     : 1
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.15.0-1054-aws-x86_64-with-glibc2.27
system       : Linux
node         : ip-172-31-54-76
release      : 4.15.0-1054-aws
version      : #56-Ubuntu SMP Thu Nov 7 16:15:59 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              64
On-line CPU(s) list: 0-63
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:            7
CPU MHz:             3223.515
BogoMIPS:            5000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-15,32-47
NUMA node1 CPU(s):   16-31,48-63
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0017 sec, LOAD: 0.5014 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0002 sec, LOAD: 0.4109 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.1224 sec, LOAD: 0.0909 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0020 sec, LOAD: 0.0959 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0077 sec, LOAD: 0.1124 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.1721 sec, LOAD: 0.4875 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0021 sec, LOAD: 0.0662 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0020 sec, LOAD: 0.0678 sec.
leezu commented 4 years ago

May be a bug in Cuda 10.0. Can't reproduce on 10.1. However, https://docs.nvidia.com/cuda/archive/10.1/cuda-toolkit-release-notes/index.html doesn't seem to list a related fix. So maybe it's nevertheless a bug in MXNet.

ptrendx commented 4 years ago

@MoisesHer Could you take a look?

MoisesHer commented 4 years ago

Yes, will take a look ASAP

MoisesHer commented 4 years ago

We have investigated this issue and we found a bug in CUDA 10.0 compiler affecting only code generated for NVIDIA Turing architecture, i.e. SM_75. NVIDIA compiler team confirmed this bug was fixed in CUDA 10.1 compiler (and beyond).

We suggest to remove SM_75 architecture when building Mxnet using CUDA Toolkit 10.0, but keeping SM_70 architecture. Note that code generated for SM_70 architecture is compatible with Turing, thus Turing GPUs are able to execute this code without any problem. This compatibility is documented here: https://docs.nvidia.com/cuda/turing-compatibility-guide/index.html#turing-volta-compatibility

leezu commented 4 years ago

@MoisesHer thanks for investigating the issue. Could you adapt https://github.com/apache/incubator-mxnet/blob/master/config/distribution/linux_cu100.cmake#L36 accordingly and add a comment inline?

szha commented 4 years ago

Testing with naive engine on this test consistently produces illegal memory access.

leezu commented 4 years ago

Close as it's a cuda bug