apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.76k stars 6.8k forks source link

AMP: an illegal memory access was encountered #18743

Open chengyuz opened 4 years ago

chengyuz commented 4 years ago

Description

i followed this link(https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html) to enable amp in my project, but with error: INFO:root:---------------------------------------------------------------------------------------------------- INFO:root:Using AMP INFO:root:Features in transition 1: 96 -> 96 INFO:root:Features in transition 2: 192 -> 192 INFO:root:Features in transition 3: 448 -> 448 [11:43:40] /media/apache-mxnet-src-1.6.0-incubating/src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./dataset/imagenet200/rec/train.rec, use 30 threads for decoding.. [11:43:42] /media/apache-mxnet-src-1.6.0-incubating/src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./dataset/imagenet200/rec/val.rec, use 30 threads for decoding.. [11:44:05] /media/apache-mxnet-src-1.6.0-incubating/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) [11:44:10] /media/apache-mxnet-src-1.6.0-incubating/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:744: only 0 out of 2 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:753: .. [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:753: .. Traceback (most recent call last): File "scripts/train_imagenet.py", line 807, in main() File "scripts/train_imagenet.py", line 803, in main train(context) File "scripts/train_imagenet.py", line 736, in train trainer.step(batch_size) File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/gluon/trainer.py", line 334, in step self._allreduce_grads() File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/gluon/trainer.py", line 364, in _allreduce_grads self._kvstore.push(i, param.list_grad(), priority=-i) File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/kvstore.py", line 234, in push self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority))) File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/base.py", line 255, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/storage/./pooled_storage_manager.h:164: cudaMalloc failed: an illegal memory access was encountered Stack trace: [bt] (0) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x7f500e8f9493] [bt] (1) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle)+0x245) [0x7f50113b6775] [bt] (2) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle)+0x59) [0x7f50113b8c79] [bt] (3) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::NDArray::NDArray(mxnet::TShape const&, mxnet::Context, bool, int)+0x52b) [0x7f500e91272b] [bt] (4) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::CommDevice::Reduce(int, std::vector<mxnet::NDArray, std::allocator > const&, int)+0x277) [0x7f500ebb5eb7] [bt] (5) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::KVStoreLocal::PushImpl(std::vector<int, std::allocator > const&, std::vector<mxnet::NDArray, std::allocator > const&, int)+0x11d) [0x7f500ebb9f5d] [bt] (6) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(MXKVStorePush+0x105) [0x7f500e903845] [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f504603fdae] [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f504603f71f]

Environment

mxnet1.6.0 build from source, gtx2080, python3.6.9

szha commented 4 years ago

how do you reproduce the error?