Closed leezu closed 4 years ago
May be a bug in Cuda 10.0. Can't reproduce on 10.1. However, https://docs.nvidia.com/cuda/archive/10.1/cuda-toolkit-release-notes/index.html doesn't seem to list a related fix. So maybe it's nevertheless a bug in MXNet.
@MoisesHer Could you take a look?
Yes, will take a look ASAP
We have investigated this issue and we found a bug in CUDA 10.0 compiler affecting only code generated for NVIDIA Turing architecture, i.e. SM_75. NVIDIA compiler team confirmed this bug was fixed in CUDA 10.1 compiler (and beyond).
We suggest to remove SM_75 architecture when building Mxnet using CUDA Toolkit 10.0, but keeping SM_70 architecture. Note that code generated for SM_70 architecture is compatible with Turing, thus Turing GPUs are able to execute this code without any problem. This compatibility is documented here: https://docs.nvidia.com/cuda/turing-compatibility-guide/index.html#turing-volta-compatibility
@MoisesHer thanks for investigating the issue. Could you adapt https://github.com/apache/incubator-mxnet/blob/master/config/distribution/linux_cu100.cmake#L36 accordingly and add a comment inline?
Testing with naive engine on this test consistently produces illegal memory access.
Close as it's a cuda bug
Description
Embedding operator in
test_operator_gpu.test_embedding_with_type
triggers illegal memory access error deterministically on G4 instance.Error Message
To Reproduce
nosetests --verbose tests/python/gpu/test_operator_gpu.py -m test_embedding_with_type
Steps to reproduce
CC=clang-9 CXX=clang++-9 cmake -GNinja -DUSE_MKLDNN=1 -DUSE_CUDA=ON .. ; ninja
nosetests --verbose --stop ../tests/python/gpu/test_operator_gpu.py -m test_embedding_with_type
Environment