dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 535 forks source link

Have problom in BERT pre-training: how to training on multiple GPUs #1508

Open yangshuo0323 opened 3 years ago

yangshuo0323 commented 3 years ago

Description

image

Seek help:

I have read the guidance, but still don't known how to running. Please help me, or can I have correct instruction or suggestion ? thanks.

leezu commented 3 years ago

Please provide the complete error message

yangshuo0323 commented 3 years ago

Please provide the complete error message

the whole message:

[1,5]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,4]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,7]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,6]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,2]<stderr>:[21:43:11] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,1]<stderr>:[21:43:11] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,0]<stderr>:[21:43:11] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,3]<stderr>:[21:43:11] src/storage/storage.cc:[1,3]<stderr>:110: Using GPUPooledRoundedStorageManager.
[1,7]<stderr>:INFO:root:Model created
[1,7]<stderr>:DEBUG:root:Random seed set to 91
[1,7]<stderr>:INFO:root:Begin process dataset......
[1,7]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 7
[1,7]<stderr>:INFO:root:400 files are found.
[1,4]<stderr>:INFO:root:Model created
[1,4]<stderr>:DEBUG:root:Random seed set to 580
[1,4]<stderr>:INFO:root:Begin process dataset......
[1,4]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 4
[1,4]<stderr>:INFO:root:400 files are found.
[1,6]<stderr>:INFO:root:Model created
[1,6]<stderr>:DEBUG:root:Random seed set to 555
[1,6]<stderr>:INFO:root:Begin process dataset......
[1,6]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 6
[1,6]<stderr>:INFO:root:400 files are found.
[1,5]<stderr>:INFO:root:Model created
[1,5]<stderr>:DEBUG:root:Random seed set to 185
[1,5]<stderr>:INFO:root:Begin process dataset......
[1,5]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 5
[1,5]<stderr>:INFO:root:400 files are found.
[1,7]<stderr>:[node106:26504:0:26504] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,7]<stderr>:==== backtrace ====
[1,7]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f5c21681cec]
[1,7]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f5c21681f64]
[1,7]<stderr>:    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f5e1fe55d44]
[1,7]<stderr>:    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f5dc2100564]
[1,7]<stderr>:    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f5dc2103790]
[1,7]<stderr>:    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f5dc20fbed1]
[1,7]<stderr>:    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f5dc20d69d4]
[1,7]<stderr>:    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f5c3750818f]
[1,7]<stderr>:    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f5c374ffd84]
[1,7]<stderr>:    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f5e1ee829dd]
[1,7]<stderr>:   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f5e1ee82067]
[1,7]<stderr>:   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f5e200b327e]
[1,7]<stderr>:   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f5e200b3cb4]
[1,7]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) [0x55d260c8500b]
[1,7]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x55d260ce99a1]
[1,7]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x55d260c2d2b9]
[1,7]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) [0x55d260c7d497]
[1,7]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x55d260ce5cba]
[1,7]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x55d260c2d2b9]
[1,7]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) [0x55d260c7d497]
[1,7]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x55d260ce5cba]
[1,7]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) [0x55d260c7d20b]
[1,7]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x55d260ce4be6]
[1,7]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x55d260c2d2b9]
[1,7]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x55d260c2e1d4]
[1,7]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x55d260c2e1fc]
[1,7]<stderr>:   26  python(+0x22bf44) [0x55d260d43f44]
[1,7]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x55d260d4e2b1]
[1,7]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x55d260d4e4a3]
[1,7]<stderr>:   29  python(+0x2375d5) [0x55d260d4f5d5]
[1,7]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x55d260d4f6fc]
[1,7]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f5e1faa2840]
[1,7]<stderr>:   32  python(+0x1dc3c0) [0x55d260cf43c0]
[1,7]<stderr>:===================
[1,4]<stderr>:[node106:26501:0:26501] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,4]<stderr>:==== backtrace ====
[1,4]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f5fb1eb6cec]
[1,4]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f5fb1eb6f64]
[1,4]<stderr>:    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f61b05e1d44]
[1,4]<stderr>:    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f615288c564]
[1,4]<stderr>:    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f615288f790]
[1,4]<stderr>:    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f6152887ed1]
[1,4]<stderr>:    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f61528629d4]
[1,4]<stderr>:    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f5fc7ca718f]
[1,4]<stderr>:    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f5fc7c9ed84]
[1,4]<stderr>:    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f61af60e9dd]
[1,4]<stderr>:   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f61af60e067]
[1,4]<stderr>:   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f61b083f27e]
[1,4]<stderr>:   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f61b083fcb4]
[1,4]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) [0x55e8922d700b]
[1,4]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x55e89233b9a1]
[1,4]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x55e89227f2b9]
[1,4]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) [0x55e8922cf497]
[1,4]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x55e892337cba]
[1,4]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x55e89227f2b9]
[1,4]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) [0x55e8922cf497]
[1,4]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x55e892337cba]
[1,4]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) [0x55e8922cf20b]
[1,4]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x55e892336be6]
[1,4]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x55e89227f2b9]
[1,4]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x55e8922801d4]
[1,4]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x55e8922801fc]
[1,4]<stderr>:   26  python(+0x22bf44) [0x55e892395f44]
[1,4]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x55e8923a02b1]
[1,4]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x55e8923a04a3]
[1,4]<stderr>:   29  python(+0x2375d5) [0x55e8923a15d5]
[1,4]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x55e8923a16fc]
[1,4]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f61b022e840]
[1,4]<stderr>:   32  python(+0x1dc3c0) [0x55e8923463c0]
[1,4]<stderr>:===================
[1,5]<stderr>:[node106:26502:0:26502] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,5]<stderr>:==== backtrace ====
[1,6]<stderr>:[node106:26503:0:26503] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,6]<stderr>:==== backtrace ====
[1,5]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f40f065bcec]
[1,5]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f40f065bf64]
[1,5]<stderr>:    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f42ead77d44]
[1,5]<stderr>:    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f428d022564]
[1,5]<stderr>:    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f428d025790]
[1,5]<stderr>:    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f428d01ded1]
[1,5]<stderr>:    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f428cff89d4]
[1,5]<stderr>:    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f410243a18f]
[1,5]<stderr>:    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f4102431d84]
[1,5]<stderr>:    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f42e9da49dd]
[1,5]<stderr>:   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f42e9da4067]
[1,5]<stderr>:   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f42eafd527e]
[1,5]<stderr>:   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f42eafd5cb4]
[1,5]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) [0x564d0453c00b]
[1,5]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x564d045a09a1]
[1,5]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497]
[1,5]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
[1,5]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497]
[1,5]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
[1,5]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) [0x564d0453420b]
[1,5]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x564d0459bbe6]
[1,5]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x564d044e51d4]
[1,5]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x564d044e51fc]
[1,5]<stderr>:   26  python(+0x22bf44) [0x564d045faf44]
[1,5]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x564d046052b1]
[1,5]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x564d046054a3]
[1,5]<stderr>:   29  python(+0x2375d5) [0x564d046065d5]
[1,5]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x564d046066fc]
[1,5]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f42ea9c4840]
[1,5]<stderr>:   32  python(+0x1dc3c0) [0x564d045ab3c0]
[1,5]<stderr>:===================
[1,6]<stderr>:    0  /usr/lib/libucs.so.0(+0x1fcec) [0x7f1a6c25bcec]
[1,6]<stderr>:    1  /usr/lib/libucs.so.0(+0x1ff64) [0x7f1a6c25bf64]
[1,6]<stderr>:    2  /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f1c66a2ad44]
[1,6]<stderr>:    3  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f1c08cd5564]
[1,6]<stderr>:    4  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f1c08cd8790]
[1,6]<stderr>:    5  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f1c08cd0ed1]
[1,6]<stderr>:    6  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f1c08cab9d4]
[1,6]<stderr>:    7  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f1a7e0e118f]
[1,6]<stderr>:    8  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f1a7e0d8d84]
[1,6]<stderr>:    9  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f1c65a579dd]
[1,6]<stderr>:   10  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f1c65a57067]
[1,6]<stderr>:   11  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f1c66c8827e]
[1,6]<stderr>:   12  /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f1c66c88cb4]
[1,6]<stderr>:   13  python(_PyObject_FastCallKeywords+0x48b) [0x562df52e800b]
[1,6]<stderr>:   14  python(_PyEval_EvalFrameDefault+0x51d1) [0x562df534c9a1]
[1,6]<stderr>:   15  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>:   16  python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497]
[1,6]<stderr>:   17  python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
[1,6]<stderr>:   18  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>:   19  python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497]
[1,6]<stderr>:   20  python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
[1,6]<stderr>:   21  python(_PyFunction_FastCallKeywords+0xfb) [0x562df52e020b]
[1,6]<stderr>:   22  python(_PyEval_EvalFrameDefault+0x416) [0x562df5347be6]
[1,6]<stderr>:   23  python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>:   24  python(PyEval_EvalCodeEx+0x44) [0x562df52911d4]
[1,6]<stderr>:   25  python(PyEval_EvalCode+0x1c) [0x562df52911fc]
[1,6]<stderr>:   26  python(+0x22bf44) [0x562df53a6f44]
[1,6]<stderr>:   27  python(PyRun_FileExFlags+0xa1) [0x562df53b12b1]
[1,6]<stderr>:   28  python(PyRun_SimpleFileExFlags+0x1c3) [0x562df53b14a3]
[1,6]<stderr>:   29  python(+0x2375d5) [0x562df53b25d5]
[1,6]<stderr>:   30  python(_Py_UnixMain+0x3c) [0x562df53b26fc]
[1,6]<stderr>:   31  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f1c66677840]
[1,6]<stderr>:   32  python(+0x1dc3c0) [0x562df53573c0]
[1,6]<stderr>:===================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node node106 exited on signal 11 (Segmentation fault).
yangshuo0323 commented 3 years ago

Firstly, I want to make sure: is my method correct for pre-training BERT model on multiply GPUs? @leezu

leezu commented 3 years ago
Software environment: Python: 3.7.7, Cuda: 10.2
Install MXNet: pip install mxnet-cu102 , verion is 1.7.0
Download Model script: https://github.com/dmlc/gluon-nlp, which branch is 2.0.

Do you mean that you use gluon-nlp master branch with MXNet 1.7? It's not supported. You need to use MXNet 2 Alpha release https://github.com/apache/incubator-mxnet/releases/v2.0.0-alpha for using GluonNLP master branch. If you don't like to compile MXNet from source, you can also just follow https://github.com/dmlc/gluon-nlp#installation

yangshuo0323 commented 3 years ago
Software environment: Python: 3.7.7, Cuda: 10.2
Install MXNet: pip install mxnet-cu102 , verion is 1.7.0
Download Model script: https://github.com/dmlc/gluon-nlp, which branch is 2.0.

Do you mean that you use gluon-nlp master branch with MXNet 1.7? It's not supported. You need to use MXNet 2 Alpha release https://github.com/apache/incubator-mxnet/releases/v2.0.0-alpha for using GluonNLP master branch. If you don't like to compile MXNet from source, you can also just follow https://github.com/dmlc/gluon-nlp#installation

I use gluon-nlp branch 2.0 with MXNet 1.7. Is it also not supported? I will try as you suggest. think you.

yangshuo0323 commented 3 years ago

I think my environment of 'mpirun' mybe wrong, such as optional parameters:

mpirun -np 8 -H localhost:8 -mca pml ob1 -mca btl ^openib \
       -mca btl_tcp_if_exclude docker0,lo --map-by ppr:4:socket \
       --mca plm_rsh_agent 'ssh -q -o StrictHostKeyChecking=no' \
       -x NCCL_MIN_NRINGS=8 -x NCCL_DEBUG=INFO -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 \
       -x MXNET_SAFE_ACCUMULATION=1 --tag-output \

it may causes problems with inter-process communication. So, what parameters need to set for Multi-GPU training ? @leezu

leezu commented 3 years ago

I use gluon-nlp branch 2.0 with MXNet 1.7. Is it also not supported?

I don't know how this branch was created, but there is actually no gluon-nlp 2.0. cc @szha @sxjscience let's delete the branch? The branch contains commits of GluonNLP 0.x, so yes, it should work with MXNet 1.7

sxjscience commented 3 years ago

I have no idea about the 2.0 branch. We may just delete it.

@yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert

yangshuo0323 commented 3 years ago

I have no idea about the 2.0 branch. We may just delete it.

@yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert

I have tried gluon-nlp branch 0.10.0, and also happened this error. So, gluon-nlp(0.10.0) and MXNet(1.6.0 or 1.7.0) are compatibled, right ? I will check other software environment...

sxjscience commented 3 years ago

That should work. In fact, is it feasible to try out our new version with the custom version of MXNet 2.0 and the GluonNLP master branch?

Get Outlook for iOShttps://aka.ms/o0ukef


From: yangshuo0323 notifications@github.com Sent: Friday, January 29, 2021 7:54:06 PM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Mention mention@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Have problom in BERT pre-training: how to training on multiple GPUs (#1508)

I have no idea about the 2.0 branch. We may just delete it.

@yangshuo0323https://github.com/yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert

I have tried gluon-nlp branch 0.10.0, and also happened this error. So, gluon-nlp(0.10.0) and MXNet(1.6.0 or 1.7.0) are compatibled, right ? I will check other software environment...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/issues/1508#issuecomment-770151635, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3X3PUCHIMHSXGPBYLLS4N7F5ANCNFSM4WWUK4MA.

yangshuo0323 commented 3 years ago

Ok, I will try out the new version of MXNet and GluonNLP. Thank you so much!

That should work. In fact, is it feasible to try out our new version with the custom version of MXNet 2.0 and the GluonNLP master branch? Get Outlook for iOShttps://aka.ms/o0ukef ____ From: yangshuo0323 notifications@github.com Sent: Friday, January 29, 2021 7:54:06 PM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Mention mention@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Have problom in BERT pre-training: how to training on multiple GPUs (#1508) I have no idea about the 2.0 branch. We may just delete it. @yangshuo0323https://github.com/yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert I have tried gluon-nlp branch 0.10.0, and also happened this error. So, gluon-nlp(0.10.0) and MXNet(1.6.0 or 1.7.0) are compatibled, right ? I will check other software environment... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1508 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3X3PUCHIMHSXGPBYLLS4N7F5ANCNFSM4WWUK4MA.

sxjscience commented 3 years ago

@yangshuo0323 Thanks! I will encourage to try our new version and we can help you if you meet any problems in training the model. To try the new MXNet, you can install with the following command:

# Install the version with CUDA 10.1
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the version with CUDA 11
python3 -m pip install -U --pre "mxnet-cu110>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0b20210121" -f https://dist.mxnet.io/python

Also, you can just clone gluonnlp/master and install via the following command:

python3 -m pip install -U -e ."[extras]"

This will give the nlp_data and nlp_process CLI. You can use nlp_data to download corpus like wikipedia and bookcorpus and

Also, you are recommended to install horovod via

HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_MPI=1 HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_TENSORFLOW=1 python3 -m pip install --no-cache-dir horovod

After that, feel free to try out the example in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert. We will try to help with any issues that you met.

yangshuo0323 commented 3 years ago

The previous error was due to the wrong installation of horovod, which maybe not use the env HOROVOD_WITH_MXNET.
Thanks to everyone who give me advice above. I will enjoy to try the new version as you advice.

@yangshuo0323 Thanks! I will encourage to try our new version and we can help you if you meet any problems in training the model. To try the new MXNet, you can install with the following command:

# Install the version with CUDA 10.1
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the version with CUDA 11
python3 -m pip install -U --pre "mxnet-cu110>=2.0.0b20210121" -f https://dist.mxnet.io/python

# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0b20210121" -f https://dist.mxnet.io/python

Also, you can just clone gluonnlp/master and install via the following command:

python3 -m pip install -U -e ."[extras]"

This will give the nlp_data and nlp_process CLI. You can use nlp_data to download corpus like wikipedia and bookcorpus and

Also, you are recommended to install horovod via

HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_MPI=1 HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_TENSORFLOW=1 python3 -m pip install --no-cache-dir horovod

After that, feel free to try out the example in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert. We will try to help with any issues that you met.