Closed northhj closed 2 years ago
Sorry, the error message is unclear because the calling stack is nested: ipyparallel -> IPython -> BlueFog -> NCCL.
"bluefog/common/nccl_controller.cc:750 'invalid usage'" ==> happening at ncclRecv
and invalid usage probably means the input argument is wrong. Unfortunately, I cannot reproduce your error. Can you provide more information?
One possible reason is hardware. Currently, NCCL-based communication requires that number of the process <= # GPU.
If that was the root problem, one circumvent is adding another environment variable that BLUEFOG_OPS_ON_CPU=1
before the ibfrun, which will force the BLUEFOG to use MPI to do the communication. (It may sacrifice the performance a little bit due to extra CPU-GPU copy movement).
I think it should not be a hardware problem. There are 10 sheets of 2080ti on the server . And the command I used was 'ibfrun start -np 4' . Moreover , I don't input any arguement except above command . I really want to slove this problem , what message do you need to slove it?
We have discussed this offline. It is most likely the CUDA library is not installed probably, which is not related to BlueFog. So I will close this issue. But if you need more help on this, feel free to re-open it.
I installed bluefog-0.3.0 in server which is consist of 10 2080ti.But when I run [bluefog-tutorial] "Applying BlueFog on Deep Learning problem(High Level API Introduction).ipynb" by command : 'ibfrun start -np 4 ', some error happend when run the 'Start decentralized trainning' cell . The error follows :
Failed, NCCL error bluefog/common/nccl_controller.cc:750 'invalid usage' Failed, NCCL error bluefog/common/nccl_controller.cc:750 'invalid usage' Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. 2022-04-13 00:30:04.355 [KernelNanny.0] Parent 45415 exited with status None. 2022-04-13 00:30:04.356 [KernelNanny.0] Notifying Hub that our parent has shut down mpirun noticed that process rank 3 with PID 0 on node Server8 exited on signal 11 (Segmentation fault). Try to kill ipcontroller process but cannot retrieve its pid. Maybe it is already been stopped. removed ipengine_config file
My environment is Ubuntu-18.04 Nccl-2.12 Openmpi-4.0.7 Bluefog-0.3.0 Why does this error happens? Is because version of Nccl inappropriate?Can anyone help me ? Thanks very much !