Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

some error happened #108

Closed northhj closed 2 years ago

northhj commented 2 years ago

I installed bluefog-0.3.0 in server which is consist of 10 2080ti.But when I run [bluefog-tutorial] "Applying BlueFog on Deep Learning problem(High Level API Introduction).ipynb" by command : 'ibfrun start -np 4 ', some error happend when run the 'Start decentralized trainning' cell . The error follows :

Failed, NCCL error bluefog/common/nccl_controller.cc:750 'invalid usage' Failed, NCCL error bluefog/common/nccl_controller.cc:750 'invalid usage' Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. 2022-04-13 00:30:04.355 [KernelNanny.0] Parent 45415 exited with status None. 2022-04-13 00:30:04.356 [KernelNanny.0] Notifying Hub that our parent has shut down mpirun noticed that process rank 3 with PID 0 on node Server8 exited on signal 11 (Segmentation fault). Try to kill ipcontroller process but cannot retrieve its pid. Maybe it is already been stopped. removed ipengine_config file

My environment is Ubuntu-18.04 Nccl-2.12 Openmpi-4.0.7 Bluefog-0.3.0 Why does this error happens? Is because version of Nccl inappropriate?Can anyone help me ? Thanks very much !

BichengYing commented 2 years ago

Sorry, the error message is unclear because the calling stack is nested: ipyparallel -> IPython -> BlueFog -> NCCL. "bluefog/common/nccl_controller.cc:750 'invalid usage'" ==> happening at ncclRecv and invalid usage probably means the input argument is wrong. Unfortunately, I cannot reproduce your error. Can you provide more information? One possible reason is hardware. Currently, NCCL-based communication requires that number of the process <= # GPU. If that was the root problem, one circumvent is adding another environment variable that BLUEFOG_OPS_ON_CPU=1 before the ibfrun, which will force the BLUEFOG to use MPI to do the communication. (It may sacrifice the performance a little bit due to extra CPU-GPU copy movement).

lkzs commented 2 years ago

I think it should not be a hardware problem. There are 10 sheets of 2080ti on the server . And the command I used was 'ibfrun start -np 4' . Moreover , I don't input any arguement except above command . I really want to slove this problem , what message do you need to slove it?

BichengYing commented 2 years ago

We have discussed this offline. It is most likely the CUDA library is not installed probably, which is not related to BlueFog. So I will close this issue. But if you need more help on this, feel free to re-open it.