chainer / chainermn

ChainerMN: Scalable distributed deep learning with Chainer
https://chainer.org
MIT License
207 stars 57 forks source link

NCCL_ERROR_SYSTEM_ERROR: unhandled system error #285

Closed Fhrozen closed 5 years ago

Fhrozen commented 6 years ago

Hello there:

I am trying to run some examples using chainer on a multinode multigpu server and have experimented some problems. To test chainermn in the server, I am using example/mnist/train_mnist.py. My setup is:

Server: MARCC (https://www.marcc.jhu.edu)
OS: Centos 6.9
python: 2.7 (miniconda)
chainer: 4.3
cupy: 4.3
mpi: OpenMPI 3.1.1 (cuda-awared
nccl: v2.2
glibc: 2.17
gcc: 5.4

I already tried every step on the Step-to-Step troubleshot (https://chainermn.readthedocs.io/en/latest/installation/troubleshooting.html) but I found no error. To test nccl, I also built the nccl-tests (https://github.com/NVIDIA/nccl-tests) to test MPI and NCCL and it executed normally. I have no problem with the mpiexec -n n python train_mnist.py (only cpu), but when I used gpus I have two behaivors: If I used single node multi GPU, the output of the program is the following:

[gpu016:112991] pml_ucx.c:226 Error: UCP worker does not support MPI_THREAD_MULTIPLE
[gpu016:112990] pml_ucx.c:226 Error: UCP worker does not support MPI_THREAD_MULTIPLE
0
==========================================
Num process (COMM_WORLD): 2
Using GPUs
Using pure_nccl communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================
1
epoch       main/loss   validation/main/loss  main/accuracy  validation/main/accuracy  elapsed_time
Exception in main training loop: NCCL_ERROR_SYSTEM_ERROR: unhandled system error
Traceback (most recent call last):
Exception in main training loop: NCCL_ERROR_SYSTEM_ERROR: unhandled system error
Traceback (most recent call last):
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()
    update()
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
    self.update_core()
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 160, in update_core
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 160, in update_core
    optimizer.update(loss_func, *in_arrays)
    optimizer.update(loss_func, *in_arrays)
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/optimizers.py", line 30, in update
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/optimizers.py", line 30, in update
    self.communicator.allreduce_grad(target)
    self.communicator.allreduce_grad(target)
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 49, in allreduce_grad
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 49, in allreduce_grad
    self._allreduce_grad_async(model, stream)
    self._allreduce_grad_async(model, stream)
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 52, in _allreduce_grad_async
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 52, in _allreduce_grad_async
    self._init_comms()
    self._init_comms()
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 45, in _init_comms
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 45, in _init_comms
    self.nccl_comm = _communication_utility.init_nccl_comm(self.mpi_comm)
    self.nccl_comm = _communication_utility.init_nccl_comm(self.mpi_comm)
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/_communication_utility.py", line 74, in init_nccl_comm
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/_communication_utility.py", line 74, in init_nccl_comm
    return nccl.NcclCommunicator(mpi_comm.size, nccl_comm_id, mpi_comm.rank)
  File "cupy/cuda/nccl.pyx", line 127, in cupy.cuda.nccl.NcclCommunicator.__init__
    return nccl.NcclCommunicator(mpi_comm.size, nccl_comm_id, mpi_comm.rank)
  File "cupy/cuda/nccl.pyx", line 127, in cupy.cuda.nccl.NcclCommunicator.__init__
  File "cupy/cuda/nccl.pyx", line 99, in cupy.cuda.nccl.check_status
  File "cupy/cuda/nccl.pyx", line 99, in cupy.cuda.nccl.check_status
Will finalize trainer extensions and updater before reraising the exception.
Will finalize trainer extensions and updater before reraising the exception.
ESC[JTraceback (most recent call last):
  File "train_mnist.py", line 123, in <module>
Traceback (most recent call last):
  File "train_mnist.py", line 123, in <module>
    main()
  File "train_mnist.py", line 119, in main
    trainer.run()
    main()
  File "train_mnist.py", line 119, in main
    trainer.run()
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run
    six.reraise(*sys.exc_info())
    six.reraise(*sys.exc_info())
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/trainer.py", line 306, in run
    update()
    update()
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 149, in update
    self.update_core()
    self.update_core()
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 160, in update_core
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainer/training/updaters/standard_updater.py", line 160, in update_core
    optimizer.update(loss_func, *in_arrays)
    optimizer.update(loss_func, *in_arrays)
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/optimizers.py", line 30, in update
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/optimizers.py", line 30, in update
    self.communicator.allreduce_grad(target)
    self.communicator.allreduce_grad(target)
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 49, in allreduce_grad
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 49, in allreduce_grad
    self._allreduce_grad_async(model, stream)
    self._allreduce_grad_async(model, stream)
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 52, in _allreduce_grad_async
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 52, in _allreduce_grad_async
    self._init_comms()
    self._init_comms()
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 45, in _init_comms
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/pure_nccl_communicator.py", line 45, in _init_comms
    self.nccl_comm = _communication_utility.init_nccl_comm(self.mpi_comm)
    self.nccl_comm = _communication_utility.init_nccl_comm(self.mpi_comm)
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/_communication_utility.py", line 74, in init_nccl_comm
  File "/home-3/nyalta1@jhu.edu/miniconda2/lib/python2.7/site-packages/chainermn/communicators/_communication_utility.py", line 74, in init_nccl_comm
    return nccl.NcclCommunicator(mpi_comm.size, nccl_comm_id, mpi_comm.rank)
    return nccl.NcclCommunicator(mpi_comm.size, nccl_comm_id, mpi_comm.rank)
  File "cupy/cuda/nccl.pyx", line 127, in cupy.cuda.nccl.NcclCommunicator.__init__
  File "cupy/cuda/nccl.pyx", line 127, in cupy.cuda.nccl.NcclCommunicator.__init__
  File "cupy/cuda/nccl.pyx", line 99, in cupy.cuda.nccl.check_status
  File "cupy/cuda/nccl.pyx", line 99, in cupy.cuda.nccl.check_status
cupy.cuda.nccl.NcclError: NCCL_ERROR_SYSTEM_ERROR: unhandled system error
cupy.cuda.nccl.NcclError: NCCL_ERROR_SYSTEM_ERROR: unhandled system error
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[63211,1],1]
  Exit code:    1
--------------------------------------------------------------------------
Finished with job 28700116

The error is also produced when I use a single node single gpu mpiexec -n 1 python train_mnist.py -g And, when I used multi-node (single or multi) gpu the program freezes after displaying:

==========================================
Num process (COMM_WORLD): 2
Using GPUs
Using pure_nccl communicator
Num unit: 1000
Num Minibatch-size: 100
Num epoch: 20
==========================================

I tried with pure_nccl and hierarchical communicator. Is there any additional configuration to run chainer in this setup, because I already tried with a single node and single gpu (no chainermn) and it finishes the training correctly. However, when I tried single node multigpu (no chainermn), the programs also freezes and is canceled due to timeout.

keisukefukuda commented 6 years ago

Hi @Fhrozen , thanks for trying ChainerMN. First, which ChainerMN version are you using? (1.3 or master?)

About the issue of NCCL, NCCL is actually a black box library and the error message unhandled system error does not contain much information. I will discuss it with my colleagues anyways.

On the second issue, please try Open MPI 2.x series (2.1.3 would be a good choice). Open MPI 3.x has a bug on GPU-Direct communication (issue number 3792 on open-mpi/ompi repository, as indicated in our issue #221 ). From the error messages on the first line, it seems you use UCX BTL component. The ompi bug is in openib component, so I'm not sure it really fixes the issue, but it worth trying.

Thanks, Keisuke

keisukefukuda commented 6 years ago

Also, will you try again with the environmental variable NCCL_DEBUG=INFO ? It is not very useful in many cases, but better than nothing in this case.

Thanks. Keisuke

Fhrozen commented 6 years ago

@keisukefukuda Thank you for your support. I am using the current version from pip 1.3. I am running now some test with NCCL and MPI to check the full functionallity of the last. I also running some test with Open MPI 2.x. will update you as soon as I get any new information.