apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.74k stars 6.81k forks source link

cuda memcheck failures with different cuda versions #15273

Open anirudh2290 opened 5 years ago

anirudh2290 commented 5 years ago

This was encountered during work on the PR: https://github.com/apache/incubator-mxnet/pull/15118. This is also related to https://github.com/apache/incubator-mxnet/issues/10988.

There are a lot of cuda-memcheck failures when MXNet is built with CUDA-10.0 which I don't see happening on CUDA-9.2.

On CUDA-9.2:

cuda-memcheck nosetests -v tests/python/gpu/test_operator_gpu.py:test_embedding
========= CUDA-MEMCHECK
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=809559325 to reproduce.
test_operator_gpu.test_embedding ... ok

----------------------------------------------------------------------
Ran 1 test in 26.204s

OK
========= ERROR SUMMARY: 0 errors
 ubuntu@ip-172-31-71-199  ~/experimentals/1.4_release   fp16_convert_model ●  cuda-memcheck nosetests -v tests/python/gpu/test_operator_gpu.py:test_broadcast
========= CUDA-MEMCHECK
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1395797003 to reproduce.
test_operator_gpu.test_broadcast ... ok

----------------------------------------------------------------------
Ran 1 test in 75.909s

OK
========= ERROR SUMMARY: 0 errors
  cuda-memcheck nosetests -v tests/python/gpu/test_operator_gpu.py:test_countsketch
========= CUDA-MEMCHECK
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=837487385 to reproduce.
test_operator_gpu.test_countsketch ... ok

----------------------------------------------------------------------
Ran 1 test in 44.046s

OK
========= ERROR SUMMARY: 0 errors

On CUDA-10.0

/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
Assertion failure at kmp_runtime.cpp(6481): __kmp_team_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6481).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see https://bugs.llvm.org/.
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=739806880 to reproduce.
test_operator_gpu.test_countsketch ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2146529091 to reproduce.
ERROR

======================================================================
ERROR: test_operator_gpu.test_countsketch
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/experimentals/1.4_release/tests/python/gpu/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/home/ubuntu/experimentals/1.4_release/tests/python/gpu/test_operator_gpu.py", line 95, in test_countsketch
    check_countsketch(in_dim, out_dim, n)
  File "/home/ubuntu/experimentals/1.4_release/tests/python/gpu/test_operator_gpu.py", line 82, in check_countsketch
    check_symbolic_backward(sym, locations, [out_grad], [a], rtol=1e-3, atol=1e-5, ctx=mx.gpu(0))
  File "/home/ubuntu/experimentals/1.4_release/python/mxnet/test_utils.py", line 1191, in check_symbolic_backward
    grads = {k: v.asnumpy() for k, v in args_grad_data.items()}
  File "/home/ubuntu/experimentals/1.4_release/python/mxnet/test_utils.py", line 1191, in <dictcomp>
    grads = {k: v.asnumpy() for k, v in args_grad_data.items()}
  File "/home/ubuntu/experimentals/1.4_release/python/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/ubuntu/experimentals/1.4_release/python/mxnet/base.py", line 253, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [23:04:54] ../include/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess: CUDA: unspecified launch failure
Stack trace:
  [bt] (0) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x34) [0x7f633d1d1642]
  [bt] (1) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0x168) [0x7f633d3489d0]
  [bt] (2) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(+0x2a763f5) [0x7f633d3993f5]
  [bt] (3) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(+0x2a7be78) [0x7f633d39ee78]
  [bt] (4) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x56) [0x7f633d33f322]
  [bt] (5) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x3b1) [0x7f633d355df3]
  [bt] (6) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x231) [0x7f633d35bcef]
  [bt] (7) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}::operator()(dmlc::ManualEvent) const+0x50) [0x7f633d357990]
  [bt] (8) /home/ubuntu/experimentals/1.4_release/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x5c) [0x7f633d35ec56]

-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=739806880 to reproduce.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2146529091 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 49.733s

FAILED (errors=1)

cuda memcheck output: more than 1000 errors

Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaFuncSetAttribute.

When I change to cuda 10.1 these errors go away. Note that I have only observed them with DEV=1 with make (especially the --werror cross-space-execution) nvcc flag. I think we should also update centos7 docker image to run on cuda 10.1

EDIT: I still see issues for countsketch on 10.1 when run with memcheck but these seem to be still addressable issues with operator but this is different from 10.0 where multiple operators are impacted and seem to be difficult to address.

@marcoabreu @stu1130

mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Cuda, Installation, Build

ptrendx commented 5 years ago

Do you also see those errors when testing operator that does not have issues?

anirudh2290 commented 5 years ago

i tested with broadcast, countsketch, embedding. all ops failed with this error:

Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaFuncSetAttribute.

In 10.1 broadcast and embedding has no issue, countsketch had a read out of bounds issue which is specific to operator.

But the cuda memcheck issue invalid device function happened for the three ops i tested for cuda 10.

leleamol commented 5 years ago

@mxnet-label-bot add [Cuda]