Open yangshuo0323 opened 3 years ago
Please provide the complete error message
Please provide the complete error message
the whole message:
[1,5]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,4]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,7]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,6]<stderr>:[21:43:10] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,2]<stderr>:[21:43:11] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,1]<stderr>:[21:43:11] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,0]<stderr>:[21:43:11] src/storage/storage.cc:110: Using GPUPooledRoundedStorageManager.
[1,3]<stderr>:[21:43:11] src/storage/storage.cc:[1,3]<stderr>:110: Using GPUPooledRoundedStorageManager.
[1,7]<stderr>:INFO:root:Model created
[1,7]<stderr>:DEBUG:root:Random seed set to 91
[1,7]<stderr>:INFO:root:Begin process dataset......
[1,7]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 7
[1,7]<stderr>:INFO:root:400 files are found.
[1,4]<stderr>:INFO:root:Model created
[1,4]<stderr>:DEBUG:root:Random seed set to 580
[1,4]<stderr>:INFO:root:Begin process dataset......
[1,4]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 4
[1,4]<stderr>:INFO:root:400 files are found.
[1,6]<stderr>:INFO:root:Model created
[1,6]<stderr>:DEBUG:root:Random seed set to 555
[1,6]<stderr>:INFO:root:Begin process dataset......
[1,6]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 6
[1,6]<stderr>:INFO:root:400 files are found.
[1,5]<stderr>:INFO:root:Model created
[1,5]<stderr>:DEBUG:root:Random seed set to 185
[1,5]<stderr>:INFO:root:Begin process dataset......
[1,5]<stderr>:INFO:root:args.num_buckets: 1, num_workers: 8, rank: 5
[1,5]<stderr>:INFO:root:400 files are found.
[1,7]<stderr>:[node106:26504:0:26504] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,7]<stderr>:==== backtrace ====
[1,7]<stderr>: 0 /usr/lib/libucs.so.0(+0x1fcec) [0x7f5c21681cec]
[1,7]<stderr>: 1 /usr/lib/libucs.so.0(+0x1ff64) [0x7f5c21681f64]
[1,7]<stderr>: 2 /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f5e1fe55d44]
[1,7]<stderr>: 3 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f5dc2100564]
[1,7]<stderr>: 4 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f5dc2103790]
[1,7]<stderr>: 5 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f5dc20fbed1]
[1,7]<stderr>: 6 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f5dc20d69d4]
[1,7]<stderr>: 7 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f5c3750818f]
[1,7]<stderr>: 8 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f5c374ffd84]
[1,7]<stderr>: 9 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f5e1ee829dd]
[1,7]<stderr>: 10 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f5e1ee82067]
[1,7]<stderr>: 11 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f5e200b327e]
[1,7]<stderr>: 12 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f5e200b3cb4]
[1,7]<stderr>: 13 python(_PyObject_FastCallKeywords+0x48b) [0x55d260c8500b]
[1,7]<stderr>: 14 python(_PyEval_EvalFrameDefault+0x51d1) [0x55d260ce99a1]
[1,7]<stderr>: 15 python(_PyEval_EvalCodeWithName+0x2f9) [0x55d260c2d2b9]
[1,7]<stderr>: 16 python(_PyFunction_FastCallKeywords+0x387) [0x55d260c7d497]
[1,7]<stderr>: 17 python(_PyEval_EvalFrameDefault+0x14ea) [0x55d260ce5cba]
[1,7]<stderr>: 18 python(_PyEval_EvalCodeWithName+0x2f9) [0x55d260c2d2b9]
[1,7]<stderr>: 19 python(_PyFunction_FastCallKeywords+0x387) [0x55d260c7d497]
[1,7]<stderr>: 20 python(_PyEval_EvalFrameDefault+0x14ea) [0x55d260ce5cba]
[1,7]<stderr>: 21 python(_PyFunction_FastCallKeywords+0xfb) [0x55d260c7d20b]
[1,7]<stderr>: 22 python(_PyEval_EvalFrameDefault+0x416) [0x55d260ce4be6]
[1,7]<stderr>: 23 python(_PyEval_EvalCodeWithName+0x2f9) [0x55d260c2d2b9]
[1,7]<stderr>: 24 python(PyEval_EvalCodeEx+0x44) [0x55d260c2e1d4]
[1,7]<stderr>: 25 python(PyEval_EvalCode+0x1c) [0x55d260c2e1fc]
[1,7]<stderr>: 26 python(+0x22bf44) [0x55d260d43f44]
[1,7]<stderr>: 27 python(PyRun_FileExFlags+0xa1) [0x55d260d4e2b1]
[1,7]<stderr>: 28 python(PyRun_SimpleFileExFlags+0x1c3) [0x55d260d4e4a3]
[1,7]<stderr>: 29 python(+0x2375d5) [0x55d260d4f5d5]
[1,7]<stderr>: 30 python(_Py_UnixMain+0x3c) [0x55d260d4f6fc]
[1,7]<stderr>: 31 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f5e1faa2840]
[1,7]<stderr>: 32 python(+0x1dc3c0) [0x55d260cf43c0]
[1,7]<stderr>:===================
[1,4]<stderr>:[node106:26501:0:26501] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,4]<stderr>:==== backtrace ====
[1,4]<stderr>: 0 /usr/lib/libucs.so.0(+0x1fcec) [0x7f5fb1eb6cec]
[1,4]<stderr>: 1 /usr/lib/libucs.so.0(+0x1ff64) [0x7f5fb1eb6f64]
[1,4]<stderr>: 2 /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f61b05e1d44]
[1,4]<stderr>: 3 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f615288c564]
[1,4]<stderr>: 4 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f615288f790]
[1,4]<stderr>: 5 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f6152887ed1]
[1,4]<stderr>: 6 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f61528629d4]
[1,4]<stderr>: 7 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f5fc7ca718f]
[1,4]<stderr>: 8 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f5fc7c9ed84]
[1,4]<stderr>: 9 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f61af60e9dd]
[1,4]<stderr>: 10 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f61af60e067]
[1,4]<stderr>: 11 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f61b083f27e]
[1,4]<stderr>: 12 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f61b083fcb4]
[1,4]<stderr>: 13 python(_PyObject_FastCallKeywords+0x48b) [0x55e8922d700b]
[1,4]<stderr>: 14 python(_PyEval_EvalFrameDefault+0x51d1) [0x55e89233b9a1]
[1,4]<stderr>: 15 python(_PyEval_EvalCodeWithName+0x2f9) [0x55e89227f2b9]
[1,4]<stderr>: 16 python(_PyFunction_FastCallKeywords+0x387) [0x55e8922cf497]
[1,4]<stderr>: 17 python(_PyEval_EvalFrameDefault+0x14ea) [0x55e892337cba]
[1,4]<stderr>: 18 python(_PyEval_EvalCodeWithName+0x2f9) [0x55e89227f2b9]
[1,4]<stderr>: 19 python(_PyFunction_FastCallKeywords+0x387) [0x55e8922cf497]
[1,4]<stderr>: 20 python(_PyEval_EvalFrameDefault+0x14ea) [0x55e892337cba]
[1,4]<stderr>: 21 python(_PyFunction_FastCallKeywords+0xfb) [0x55e8922cf20b]
[1,4]<stderr>: 22 python(_PyEval_EvalFrameDefault+0x416) [0x55e892336be6]
[1,4]<stderr>: 23 python(_PyEval_EvalCodeWithName+0x2f9) [0x55e89227f2b9]
[1,4]<stderr>: 24 python(PyEval_EvalCodeEx+0x44) [0x55e8922801d4]
[1,4]<stderr>: 25 python(PyEval_EvalCode+0x1c) [0x55e8922801fc]
[1,4]<stderr>: 26 python(+0x22bf44) [0x55e892395f44]
[1,4]<stderr>: 27 python(PyRun_FileExFlags+0xa1) [0x55e8923a02b1]
[1,4]<stderr>: 28 python(PyRun_SimpleFileExFlags+0x1c3) [0x55e8923a04a3]
[1,4]<stderr>: 29 python(+0x2375d5) [0x55e8923a15d5]
[1,4]<stderr>: 30 python(_Py_UnixMain+0x3c) [0x55e8923a16fc]
[1,4]<stderr>: 31 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f61b022e840]
[1,4]<stderr>: 32 python(+0x1dc3c0) [0x55e8923463c0]
[1,4]<stderr>:===================
[1,5]<stderr>:[node106:26502:0:26502] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,5]<stderr>:==== backtrace ====
[1,6]<stderr>:[node106:26503:0:26503] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,6]<stderr>:==== backtrace ====
[1,5]<stderr>: 0 /usr/lib/libucs.so.0(+0x1fcec) [0x7f40f065bcec]
[1,5]<stderr>: 1 /usr/lib/libucs.so.0(+0x1ff64) [0x7f40f065bf64]
[1,5]<stderr>: 2 /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f42ead77d44]
[1,5]<stderr>: 3 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f428d022564]
[1,5]<stderr>: 4 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f428d025790]
[1,5]<stderr>: 5 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f428d01ded1]
[1,5]<stderr>: 6 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f428cff89d4]
[1,5]<stderr>: 7 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f410243a18f]
[1,5]<stderr>: 8 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f4102431d84]
[1,5]<stderr>: 9 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f42e9da49dd]
[1,5]<stderr>: 10 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f42e9da4067]
[1,5]<stderr>: 11 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f42eafd527e]
[1,5]<stderr>: 12 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f42eafd5cb4]
[1,5]<stderr>: 13 python(_PyObject_FastCallKeywords+0x48b) [0x564d0453c00b]
[1,5]<stderr>: 14 python(_PyEval_EvalFrameDefault+0x51d1) [0x564d045a09a1]
[1,5]<stderr>: 15 python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>: 16 python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497]
[1,5]<stderr>: 17 python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
[1,5]<stderr>: 18 python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>: 19 python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497]
[1,5]<stderr>: 20 python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
[1,5]<stderr>: 21 python(_PyFunction_FastCallKeywords+0xfb) [0x564d0453420b]
[1,5]<stderr>: 22 python(_PyEval_EvalFrameDefault+0x416) [0x564d0459bbe6]
[1,5]<stderr>: 23 python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>: 24 python(PyEval_EvalCodeEx+0x44) [0x564d044e51d4]
[1,5]<stderr>: 25 python(PyEval_EvalCode+0x1c) [0x564d044e51fc]
[1,5]<stderr>: 26 python(+0x22bf44) [0x564d045faf44]
[1,5]<stderr>: 27 python(PyRun_FileExFlags+0xa1) [0x564d046052b1]
[1,5]<stderr>: 28 python(PyRun_SimpleFileExFlags+0x1c3) [0x564d046054a3]
[1,5]<stderr>: 29 python(+0x2375d5) [0x564d046065d5]
[1,5]<stderr>: 30 python(_Py_UnixMain+0x3c) [0x564d046066fc]
[1,5]<stderr>: 31 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f42ea9c4840]
[1,5]<stderr>: 32 python(+0x1dc3c0) [0x564d045ab3c0]
[1,5]<stderr>:===================
[1,6]<stderr>: 0 /usr/lib/libucs.so.0(+0x1fcec) [0x7f1a6c25bcec]
[1,6]<stderr>: 1 /usr/lib/libucs.so.0(+0x1ff64) [0x7f1a6c25bf64]
[1,6]<stderr>: 2 /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f1c66a2ad44]
[1,6]<stderr>: 3 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f1c08cd5564]
[1,6]<stderr>: 4 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f1c08cd8790]
[1,6]<stderr>: 5 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f1c08cd0ed1]
[1,6]<stderr>: 6 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f1c08cab9d4]
[1,6]<stderr>: 7 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f1a7e0e118f]
[1,6]<stderr>: 8 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f1a7e0d8d84]
[1,6]<stderr>: 9 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f1c65a579dd]
[1,6]<stderr>: 10 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f1c65a57067]
[1,6]<stderr>: 11 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f1c66c8827e]
[1,6]<stderr>: 12 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f1c66c88cb4]
[1,6]<stderr>: 13 python(_PyObject_FastCallKeywords+0x48b) [0x562df52e800b]
[1,6]<stderr>: 14 python(_PyEval_EvalFrameDefault+0x51d1) [0x562df534c9a1]
[1,6]<stderr>: 15 python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>: 16 python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497]
[1,6]<stderr>: 17 python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
[1,6]<stderr>: 18 python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>: 19 python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497]
[1,6]<stderr>: 20 python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
[1,6]<stderr>: 21 python(_PyFunction_FastCallKeywords+0xfb) [0x562df52e020b]
[1,6]<stderr>: 22 python(_PyEval_EvalFrameDefault+0x416) [0x562df5347be6]
[1,6]<stderr>: 23 python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>: 24 python(PyEval_EvalCodeEx+0x44) [0x562df52911d4]
[1,6]<stderr>: 25 python(PyEval_EvalCode+0x1c) [0x562df52911fc]
[1,6]<stderr>: 26 python(+0x22bf44) [0x562df53a6f44]
[1,6]<stderr>: 27 python(PyRun_FileExFlags+0xa1) [0x562df53b12b1]
[1,6]<stderr>: 28 python(PyRun_SimpleFileExFlags+0x1c3) [0x562df53b14a3]
[1,6]<stderr>: 29 python(+0x2375d5) [0x562df53b25d5]
[1,6]<stderr>: 30 python(_Py_UnixMain+0x3c) [0x562df53b26fc]
[1,6]<stderr>: 31 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f1c66677840]
[1,6]<stderr>: 32 python(+0x1dc3c0) [0x562df53573c0]
[1,6]<stderr>:===================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node node106 exited on signal 11 (Segmentation fault).
Firstly, I want to make sure: is my method correct for pre-training BERT model on multiply GPUs? @leezu
Software environment: Python: 3.7.7, Cuda: 10.2 Install MXNet: pip install mxnet-cu102 , verion is 1.7.0 Download Model script: https://github.com/dmlc/gluon-nlp, which branch is 2.0.
Do you mean that you use gluon-nlp master branch with MXNet 1.7? It's not supported. You need to use MXNet 2 Alpha release https://github.com/apache/incubator-mxnet/releases/v2.0.0-alpha for using GluonNLP master branch. If you don't like to compile MXNet from source, you can also just follow https://github.com/dmlc/gluon-nlp#installation
Software environment: Python: 3.7.7, Cuda: 10.2 Install MXNet: pip install mxnet-cu102 , verion is 1.7.0 Download Model script: https://github.com/dmlc/gluon-nlp, which branch is 2.0.
Do you mean that you use gluon-nlp master branch with MXNet 1.7? It's not supported. You need to use MXNet 2 Alpha release https://github.com/apache/incubator-mxnet/releases/v2.0.0-alpha for using GluonNLP master branch. If you don't like to compile MXNet from source, you can also just follow https://github.com/dmlc/gluon-nlp#installation
I use gluon-nlp branch 2.0 with MXNet 1.7. Is it also not supported? I will try as you suggest. think you.
I think my environment of 'mpirun' mybe wrong, such as optional parameters:
mpirun -np 8 -H localhost:8 -mca pml ob1 -mca btl ^openib \
-mca btl_tcp_if_exclude docker0,lo --map-by ppr:4:socket \
--mca plm_rsh_agent 'ssh -q -o StrictHostKeyChecking=no' \
-x NCCL_MIN_NRINGS=8 -x NCCL_DEBUG=INFO -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 \
-x MXNET_SAFE_ACCUMULATION=1 --tag-output \
it may causes problems with inter-process communication. So, what parameters need to set for Multi-GPU training ? @leezu
I use gluon-nlp branch 2.0 with MXNet 1.7. Is it also not supported?
I don't know how this branch was created, but there is actually no gluon-nlp 2.0. cc @szha @sxjscience let's delete the branch? The branch contains commits of GluonNLP 0.x, so yes, it should work with MXNet 1.7
I have no idea about the 2.0 branch. We may just delete it.
@yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert
I have no idea about the 2.0 branch. We may just delete it.
@yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert
I have tried gluon-nlp branch 0.10.0, and also happened this error. So, gluon-nlp(0.10.0) and MXNet(1.6.0 or 1.7.0) are compatibled, right ? I will check other software environment...
That should work. In fact, is it feasible to try out our new version with the custom version of MXNet 2.0 and the GluonNLP master branch?
Get Outlook for iOShttps://aka.ms/o0ukef
From: yangshuo0323 notifications@github.com Sent: Friday, January 29, 2021 7:54:06 PM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Mention mention@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Have problom in BERT pre-training: how to training on multiple GPUs (#1508)
I have no idea about the 2.0 branch. We may just delete it.
@yangshuo0323https://github.com/yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert
I have tried gluon-nlp branch 0.10.0, and also happened this error. So, gluon-nlp(0.10.0) and MXNet(1.6.0 or 1.7.0) are compatibled, right ? I will check other software environment...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/issues/1508#issuecomment-770151635, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3X3PUCHIMHSXGPBYLLS4N7F5ANCNFSM4WWUK4MA.
Ok, I will try out the new version of MXNet and GluonNLP. Thank you so much!
That should work. In fact, is it feasible to try out our new version with the custom version of MXNet 2.0 and the GluonNLP master branch? Get Outlook for iOShttps://aka.ms/o0ukef … ____ From: yangshuo0323 notifications@github.com Sent: Friday, January 29, 2021 7:54:06 PM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Mention mention@noreply.github.com Subject: Re: [dmlc/gluon-nlp] Have problom in BERT pre-training: how to training on multiple GPUs (#1508) I have no idea about the 2.0 branch. We may just delete it. @yangshuo0323https://github.com/yangshuo0323 Feel free try out the BERT pretraining code in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert I have tried gluon-nlp branch 0.10.0, and also happened this error. So, gluon-nlp(0.10.0) and MXNet(1.6.0 or 1.7.0) are compatibled, right ? I will check other software environment... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1508 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3X3PUCHIMHSXGPBYLLS4N7F5ANCNFSM4WWUK4MA.
@yangshuo0323 Thanks! I will encourage to try our new version and we can help you if you meet any problems in training the model. To try the new MXNet, you can install with the following command:
# Install the version with CUDA 10.1
python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20210121" -f https://dist.mxnet.io/python
# Install the version with CUDA 10.2
python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20210121" -f https://dist.mxnet.io/python
# Install the version with CUDA 11
python3 -m pip install -U --pre "mxnet-cu110>=2.0.0b20210121" -f https://dist.mxnet.io/python
# Install the cpu-only version
python3 -m pip install -U --pre "mxnet>=2.0.0b20210121" -f https://dist.mxnet.io/python
Also, you can just clone gluonnlp/master and install via the following command:
python3 -m pip install -U -e ."[extras]"
This will give the nlp_data
and nlp_process
CLI. You can use nlp_data
to download corpus like wikipedia and bookcorpus and
Also, you are recommended to install horovod via
HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_MPI=1 HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_TENSORFLOW=1 python3 -m pip install --no-cache-dir horovod
After that, feel free to try out the example in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert. We will try to help with any issues that you met.
The previous error was due to the wrong installation of horovod, which maybe not use the env HOROVOD_WITH_MXNET
.
Thanks to everyone who give me advice above.
I will enjoy to try the new version as you advice.
@yangshuo0323 Thanks! I will encourage to try our new version and we can help you if you meet any problems in training the model. To try the new MXNet, you can install with the following command:
# Install the version with CUDA 10.1 python3 -m pip install -U --pre "mxnet-cu101>=2.0.0b20210121" -f https://dist.mxnet.io/python # Install the version with CUDA 10.2 python3 -m pip install -U --pre "mxnet-cu102>=2.0.0b20210121" -f https://dist.mxnet.io/python # Install the version with CUDA 11 python3 -m pip install -U --pre "mxnet-cu110>=2.0.0b20210121" -f https://dist.mxnet.io/python # Install the cpu-only version python3 -m pip install -U --pre "mxnet>=2.0.0b20210121" -f https://dist.mxnet.io/python
Also, you can just clone gluonnlp/master and install via the following command:
python3 -m pip install -U -e ."[extras]"
This will give the
nlp_data
andnlp_process
CLI. You can usenlp_data
to download corpus like wikipedia and bookcorpus andAlso, you are recommended to install horovod via
HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITHOUT_GLOO=1 HOROVOD_WITH_MPI=1 HOROVOD_WITH_MXNET=1 HOROVOD_WITHOUT_TENSORFLOW=1 python3 -m pip install --no-cache-dir horovod
After that, feel free to try out the example in https://github.com/dmlc/gluon-nlp/tree/master/scripts/pretraining/bert. We will try to help with any issues that you met.
Description
pip install mxnet-cu102
, verion is 1.7.0https://github.com/dmlc/gluon-nlp
, which branch is 2.0.gluon-nlp/scripts/bert/run_pretraining.py
:https://nlp.gluon.ai/model_zoo/bert/index.html#bert-model-zoo
Seek help:
I have read the guidance, but still don't known how to running. Please help me, or can I have correct instruction or suggestion ? thanks.