Closed Fhrozen closed 5 years ago
Hi @Fhrozen , thanks for trying ChainerMN. First, which ChainerMN version are you using? (1.3 or master?)
About the issue of NCCL, NCCL is actually a black box library and the error message unhandled system error
does not contain much information. I will discuss it with my colleagues anyways.
On the second issue, please try Open MPI 2.x series (2.1.3 would be a good choice). Open MPI 3.x has a bug on GPU-Direct communication (issue number 3792 on open-mpi/ompi
repository, as indicated in our issue #221 ).
From the error messages on the first line, it seems you use UCX BTL component. The ompi bug is in openib component, so I'm not sure it really fixes the issue, but it worth trying.
Thanks, Keisuke
Also, will you try again with the environmental variable NCCL_DEBUG=INFO
?
It is not very useful in many cases, but better than nothing in this case.
Thanks. Keisuke
@keisukefukuda Thank you for your support. I am using the current version from pip 1.3. I am running now some test with NCCL and MPI to check the full functionallity of the last. I also running some test with Open MPI 2.x. will update you as soon as I get any new information.
Hello there:
I am trying to run some examples using chainer on a multinode multigpu server and have experimented some problems. To test chainermn in the server, I am using example/mnist/train_mnist.py. My setup is:
I already tried every step on the Step-to-Step troubleshot (https://chainermn.readthedocs.io/en/latest/installation/troubleshooting.html) but I found no error. To test nccl, I also built the nccl-tests (https://github.com/NVIDIA/nccl-tests) to test MPI and NCCL and it executed normally. I have no problem with the mpiexec -n n python train_mnist.py (only cpu), but when I used gpus I have two behaivors: If I used single node multi GPU, the output of the program is the following:
The error is also produced when I use a single node single gpu
mpiexec -n 1 python train_mnist.py -g
And, when I used multi-node (single or multi) gpu the program freezes after displaying:I tried with pure_nccl and hierarchical communicator. Is there any additional configuration to run chainer in this setup, because I already tried with a single node and single gpu (no chainermn) and it finishes the training correctly. However, when I tried single node multigpu (no chainermn), the programs also freezes and is canceled due to timeout.