chainermn fails on >232 threads with NCCL_ERROR_SYSTEM_ERROR

undertherain commented 6 years ago

chainermn is working fine on small number of mpi threads, but as this number goes over 232 _init_comms() fails with an error.

here's a code to reproduce :

import chainermn
from chainermn import nccl
from mpi4py import MPI

def main():
    comm = chainermn.create_communicator("hierarchical")
    device = comm.intra_rank
    host = MPI.Get_processor_name()
    print(f"hi from r {comm.rank} of {comm.size} [intra {comm.intra_rank} of {comm.intra_size}] on {host}")
    try:
        comm._init_comms()
    except:
        print(f"FAILED r {comm.rank} of {comm.size} [intra {comm.intra_rank} of {comm.intra_size}] on {host}")
    comm.mpi_comm.Barrier()

    if comm.rank == 0:
        print("Done!")

if __name__ == "__main__":
    main()

the traceback is like this:

Traceback (most recent call last):
  File "./check.py", line 17, in <module>
    main()
  File "./check.py", line 8, in main
    comm._init_comms()
  File "/home/users/alex/opt/lib/python3.6/site-packages/chainermn/communicators/_base.py", line 239, in _init_comms
    use_nccl=self.use_nccl)
  File "/home/users/alex/opt/lib/python3.6/site-packages/chainermn/communicators/_communication_utility.py", line 70, in init_comms
    mpi_comm.size, nccl_comm_id, mpi_comm.rank)
  File "cupy/cuda/nccl.pyx", line 127, in cupy.cuda.nccl.NcclCommunicator.__init__
  File "cupy/cuda/nccl.pyx", line 99, in cupy.cuda.nccl.check_status
cupy.cuda.nccl.NcclError: NCCL_ERROR_SYSTEM_ERROR: unhandled system error

finally, this fault happens on what seem to be random subset of nodes, i.e. does not look like faulty GPUs.Though several threads are "clustered" per node.

this is example of failed threads placement

FAILED r 234 of 240 [intra 2 of 8] on node031
FAILED r 236 of 240 [intra 4 of 8] on node031
FAILED r 151 of 240 [intra 7 of 8] on node020
FAILED r 238 of 240 [intra 6 of 8] on node031
FAILED r 146 of 240 [intra 2 of 8] on node020
FAILED r 232 of 240 [intra 0 of 8] on node031
FAILED r 150 of 240 [intra 6 of 8] on node020
FAILED r 144 of 240 [intra 0 of 8] on node020.

NCCL 2.1.4 CUDA 9.1 / driver 390.30 chainer/cupy '4.0.0b4' CentOS Linux release 7.4.1708 / 3.10.0-693.17.1.el7.x86_64

undertherain commented 6 years ago

same happens on chainer 3.4.0 / cupy 2.4.0

keisukefukuda commented 6 years ago

I guess it's a NCCL's problem... NCCL is a closed-source product, and there's nothing we can do ...

tyohei commented 6 years ago

I have tested some simple MPI and NCCL C code (no Python nor ChainerMN code) on the same cluster. And get the same issue, yes it should be a NCCL's problem.

keisukefukuda commented 6 years ago

Please try increasing the following environmental settings:

stack size
limit of the number of file descriptors

keisukefukuda commented 6 years ago

As it's a NCCL's issue (and we had some offline conversation on it), I'm closing the isuse. Please let me know if you still have the problem.

undertherain commented 6 years ago

It looks like chainermn is initializing nccl_comm globally even if using hierarchical_communicator

            nccl_comm_id = mpi_comm.bcast(nccl.get_unique_id())
            nccl_comm = nccl.NcclCommunicator(
mpi_comm.size, nccl_comm_id, mpi_comm.rank)

in https://github.com/chainer/chainermn/blob/master/chainermn/communicators/_communication_utility.py

undertherain commented 6 years ago

on chainermn side it seems it should be avoided as in #224

chainer / chainermn

chainermn fails on >232 threads with NCCL_ERROR_SYSTEM_ERROR #218