apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.73k stars 6.81k forks source link

CUDA: Check failed: e == cudaSuccess: misaligned address with 3-layer BERT pretraining #19155

Open szhengac opened 3 years ago

szhengac commented 3 years ago

When I pretrained a 3-layer BERT model using GluonNLP 0.10 on one p3.24dn instance with 32GB GPU memory, I received CUDA: Check failed: e == cudaSuccess: misaligned address. With batch size 128 in total, it uses 11GB GPU memory and no error occurs. But when I slightly increased the total batch size to 176 or double it to 256, I received the error. I have cherry-picked https://github.com/apache/incubator-mxnet/pull/17767.

@sxjscience you may want to try the setting in numpy version.

szha commented 3 years ago

What's the mxnet version/commit?

szhengac commented 3 years ago

i have tried 5b506000310fd6bc5852bf4e41c0ce03ccc64013 and 1.8.0. I got the error when vocab size is 52000, and this does not happen when vocab size is 32000. So I guess embedding layer may have something to do with it.

sxjscience commented 3 years ago

Let me check it later today. It only happens in fp16?

szhengac commented 3 years ago

i only tried fp16

sxjscience commented 3 years ago

Would you try again with export MXNET_SAFE_ACCUMULATION=0? So we can check if it's caused by https://github.com/apache/incubator-mxnet/pull/18385/files

szhengac commented 3 years ago

5b50600 does not include that pr

sxjscience commented 3 years ago

So the error still appear on the master even if https://github.com/apache/incubator-mxnet/pull/18385/files is included?

szhengac commented 3 years ago

i didn't use 2.0 as it is not compatible as 0.10.

szhengac commented 3 years ago

but 1.8.0 contains #18385

szhengac commented 3 years ago

when i skip the attention layer and only keep the embedding layer and final dense layer, i received Segmentation fault with batch size 256. batch size 128 is fine.

sxjscience commented 3 years ago

@szhengac Usually, the error message is caused by the force alignment constraint in CUDA, for example, for the two lines in the following, we must ensure that

https://github.com/apache/incubator-mxnet/blob/5b506000310fd6bc5852bf4e41c0ce03ccc64013/src/operator/tensor/indexing_op-inl.cuh#L306-L307

CHECK_EQ(static_cast<size_t>(workspace->dptr_) % sizeof(IndexType), 0);
CHECK_EQ(static_cast<size_t>(workspace->dptr_ + unique_bytes)  % sizeof(IndexType), 0);

Also, usually this type of error message can be captured by MSHADOW_CUDA_POST_KERNEL_CHECK.

https://github.com/apache/incubator-mxnet/blob/5b506000310fd6bc5852bf4e41c0ce03ccc64013/src/operator/tensor/indexing_op.cu#L804-L809

Would you try to add the MSHADOW_CUDA_POST_KERNEL_CHECK like this? This helps us locate the cuda launching errors.

MSHADOW_CUDA_POST_KERNEL_CHECK(EmbeddingGradKernel);
sxjscience commented 3 years ago

Also, another check you may do is to ensure that the memory address of grad_in.dptr and grad_out.dptr_ are aligned with LType: https://github.com/apache/incubator-mxnet/blob/5b506000310fd6bc5852bf4e41c0ce03ccc64013/src/operator/tensor/indexing_op.cu#L804-L809

CHECK_EQ(static_cast<size_t>(grad_in.dptr_) % sizeof(LType), 0);
CHECK_EQ(static_cast<size_t>(grad_out.dptr_)  % sizeof(LType), 0);
ptrendx commented 3 years ago

@szhengac Could you post a larger excerpt from the crash / give repro instructions?

szhengac commented 3 years ago

@ptrendx here is more error msg:

[1,3]<stderr>:[ip-172-31-3-104:27783] *** Process received signal *** [1,3]<stderr>:[ip-172-31-3-104:27783] Signal: Segmentation fault (11) [1,3]<stderr>:[ip-172-31-3-104:27783] Signal code: Address not mapped (1) [1,3]<stderr>:[ip-172-31-3-104:27783] Failing at address: 0xfffffffffffffffc [1,3]<stderr>:[ip-172-31-3-104:27783] [ 0] /lib64/libpthread.so.0(+0x117e0)[0x7f16244fc7e0] [1,3]<stderr>:[ip-172-31-3-104:27783] [ 1] [1,0]<stderr>:[ip-172-31-3-104:27780] *** Process received signal *** [1,0]<stderr>:[ip-172-31-3-104:27780] Signal: Segmentation fault (11) [1,0]<stderr>:[ip-172-31-3-104:27780] Signal code: Address not mapped (1) [1,0]<stderr>:[ip-172-31-3-104:27780] Failing at address: 0x10e11781c [1,0]<stderr>:[ip-172-31-3-104:27780] [ 0] [1,0]<stderr>:/lib64/libpthread.so.0(+0x117e0)[0x7f10182ee7e0] [1,0]<stderr>:[ip-172-31-3-104:27780] [ 1] [1,3]<stderr>:/home/ec2-user/mxnet-private/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage30GPUPooledRoundedStorageManager5AllocEPNS_7Storage6HandleE+0x9d)[ 0x7f15ce34d61d] [1,3]<stderr>:[ip-172-31-3-104:27783] [ 2] [1,0]<stderr>:/home/ec2-user/mxnet-private/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage30GPUPooledRoundedStorageManager5AllocEPNS_7Storage6HandleE+0x9d)[ 0x7f0fc217f61d] [1,0]<stderr>:[ip-172-31-3-104:27780] [ 2] [1,3]<stderr>:/home/ec2-user/mxnet-private/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEPNS_7Storage6HandleE+0x4a)[0x7f15ce34fd9a] [1,3]<stderr>:[ip-172-31-3-104:27783] [ 3] [1,0]<stderr>:/home/ec2-user/mxnet-private/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEPNS_7Storage6HandleE+0x4a)[0x7f0fc2181d9a] [1,0]<stderr>:[ip-172-31-3-104:27780] [ 3] [1,5]<stderr>:[ip-172-31-3-104:27785] *** Process received signal ***

It would take some step for you to reproduce it. You need to use gluon-nlp 0.9 or 0.10, and prepare a sentencepiece vocab of size 52000. Then, add num_layers=3 to https://github.com/dmlc/gluon-nlp/blob/3fbe9619d9e68bc665f73c8cdf683213c6edd4d6/scripts/bert/pretraining_utils.py#L72, and run a distributed training job using Horovod on a single node.

szhengac commented 3 years ago

I have tried xingjian's suggestions on adding check to the embedding op but got no msg.