Closed undertherain closed 6 years ago
same happens on chainer 3.4.0 / cupy 2.4.0
I guess it's a NCCL's problem... NCCL is a closed-source product, and there's nothing we can do ...
I have tested some simple MPI and NCCL C code (no Python nor ChainerMN code) on the same cluster. And get the same issue, yes it should be a NCCL's problem.
Please try increasing the following environmental settings:
As it's a NCCL's issue (and we had some offline conversation on it), I'm closing the isuse. Please let me know if you still have the problem.
It looks like chainermn is initializing nccl_comm globally even if using hierarchical_communicator
nccl_comm_id = mpi_comm.bcast(nccl.get_unique_id())
nccl_comm = nccl.NcclCommunicator(
mpi_comm.size, nccl_comm_id, mpi_comm.rank)
in https://github.com/chainer/chainermn/blob/master/chainermn/communicators/_communication_utility.py
on chainermn side it seems it should be avoided as in #224
chainermn is working fine on small number of mpi threads, but as this number goes over 232 _init_comms() fails with an error.
here's a code to reproduce :
the traceback is like this:
finally, this fault happens on what seem to be random subset of nodes, i.e. does not look like faulty GPUs.Though several threads are "clustered" per node.
this is example of failed threads placement
NCCL 2.1.4 CUDA 9.1 / driver 390.30 chainer/cupy '4.0.0b4' CentOS Linux release 7.4.1708 / 3.10.0-693.17.1.el7.x86_64