Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

Segmentation fault of running Win Ops on pytorch_cifar10_resnet example #11

Closed BichengYing closed 4 years ago

BichengYing commented 4 years ago

BLUEFOG_WIN_ON_CPU=1 BLUEFOG_OPS_ON_CPU=1 mpirun -n 4 python examples/pytorch_cifar10_resnet.py

produce: [Gyes:02749] Signal: Segmentation fault (11) [Gyes:02749] Signal code: Address not mapped (1) [Gyes:02749] Failing at address: 0x10 [Gyes:02749] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fef41977890] [Gyes:02749] [ 1] /home/kun/.openmpi/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_prepare_src+0x270)[0x7feed353f910] [Gyes:02749] [ 2] /home/kun/.openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x1ac)[0x7feed2c7c7bc] [Gyes:02749] [ 3] /home/kun/.openmpi/lib/openmpi/mca_pml_ob1.so(+0x152b8)[0x7feed2c802b8] [Gyes:02749] [ 4] /home/kun/.openmpi/lib/openmpi/mca_btl_smcuda.so(mca_btl_smcuda_component_progress+0x1e7)[0x7feed3542667] [Gyes:02749] [ 5] /home/kun/.openmpi/lib/libopen-pal.so.13(opal_progress+0x4a)[0x7feedc82958a] [Gyes:02749] [ 6] /home/kun/.openmpi/lib/openmpi/mca_osc_pt2pt.so(+0x10d55)[0x7feed2640d55] [Gyes:02749] [ 7] /home/kun/.openmpi/lib/openmpi/mca_osc_pt2pt.so(+0x1132d)[0x7feed264132d] [Gyes:02749] [ 8] /home/kun/.openmpi/lib/libmpi.so.12(PMPI_Win_unlock+0x77)[0x7feedcb5e8d7] [Gyes:02749] [ 9] /home/kun/projects/NBF/bluefog/torch/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7bluefog6common13MPIController6WinPutERNS0_16TensorTableEntryE+0x202)[0x7feedd018d12] [Gyes:02749] [10] /home/kun/projects/NBF/bluefog/torch/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7bluefog6common11RunLoopOnceERNS0_18BluefogGlobalStateE+0x39d)[0x7feedd01f87d] [Gyes:02749] [11] /home/kun/projects/NBF/bluefog/torch/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7bluefog6common20BackgroundThreadLoopERNS0_18BluefogGlobalStateE+0x110)[0x7feedd01ff70] [Gyes:02749] [12] /home/kun/anaconda3/envs/bluefog/lib/python3.7/site-packages/torch/../../../libstdc++.so.6(+0xc819d)[0x7fef30b5c19d] [Gyes:02749] [13] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fef4196c6db] [Gyes:02749] [14] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fef4169588f] [Gyes:02749] End of error message

BichengYing commented 4 years ago

After switching to OpenMPI 4.0.2, this problem disappeared.