Multi-GPU training hangs

andremoeller commented 6 years ago

Hi,

I'm trying to run train_mnist.py, with multiple GPUs, but training hangs indefinitely at this point:

mpirun -np 4 python train_mnist.py

Num process (COMM_WORLD): 4 Using GPUs Using hierarchical communicator Num unit: 1000 Num Minibatch-size: 100 Num epoch: 20 epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time

I'm using CUDA 9, NCCL 2, cuda-aware OpenMPI 2.1.2, and the these:

cupy-cuda90==4.0.0b4 chainer==4.0.0b4 chainercv==0.8.0 chainermn==1.2.0

strace on the mpirun says it's just polling:

write(1, "epoch main/loss validati"..., 100epoch main/loss validation/main/loss main/accuracy validation/main/accuracy elapsed_time ) = 100 clock_gettime(CLOCK_MONOTONIC, {340982, 485569071}) = 0 gettimeofday({1520035802, 195603}, NULL) = 0 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=25, events=POLLIN}, {fd=35, events=POLLIN}, {fd=30, events=POLLIN}, {fd=32, events=POLLIN}, {fd=27, events=POLLIN}, {fd=34, events=POLLIN}, {fd=36, events=POLLIN}, {fd=0, events=POLLIN}, {fd=31, events=POLLIN}, {fd=26, events=POLLIN}], 13, -1

Any clues as to what's going wrong, or how I can figure out more about what might be going wrong?

Thanks.

undertherain commented 6 years ago

I'm having the same problem when running on multiple nodes. Multiple GPUs on one node work well, on two nodes and it freezes in training loop. One one system I'm trying to switch between different MPIs and only one is working (I've made sure to reinstall without using pip cache all related python packages after switching to new MPI implementation). On another system after software update I cant find any MPI that works ><

I've so far figured out that it freezes in self.communicator.broadcast_data(target) of _MultiNodeOptimizer after processing first batch

undertherain commented 6 years ago

more precisely, in broadcast_naive() in _communication_utility.py while trying to do mpi_comm.Bcast(buf), buf being a tuple of cffi backend_buffer and mpi type :-\ https://github.com/chainer/chainermn/blob/master/chainermn/communicators/_communication_utility.py#L81

undertherain commented 6 years ago

and my openmpi says it has cuda support

ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true

currently using Open MPI 3.0.0

check_cuda_aware.c returns OK status as well

keisukefukuda commented 6 years ago

Thanks for the reports.

First, Alex, could you check if your issue is https://github.com/open-mpi/ompi/issues/3972 This issue has been there for more than half a year, and I have just started investigating it myself. Thus we recommend using Open MPI 2.1.2.

@andremoeller, as you are using 2.1.2, it's wired. Which version of ChainerMN are you using? 1.2 release, or the master ?

undertherain commented 6 years ago

Keisuke, it does indeed very much look like that, I made small example of Bcast from GPU memory though mpi4py and cffi and it freezes as message size goes over around 1K. Will check their sample a bit later to rule out mpi4py influence, but 99% it’s that openmpi issue.

Now about the version. We have 2.1.2 on Tsubame 3 and it was working fine, but turned out to be not supporting multi-threading which I need to do some IO stuff. So I’ve compiled same version of openmpi in userspace and I have the same problem with it.

andremoeller commented 6 years ago

Hi Keisuke, I'm using

cupy-cuda90==4.0.0b4 chainer==4.0.0b4 chainercv==0.8.0 chainermn==1.2.0

Thanks.

keisukefukuda commented 6 years ago

@andremoeller , Oh, sorry, I missed it in your first comment. thanks for the info.

keisukefukuda commented 6 years ago

@undertherain I understand that

The system's default Open MPI 2.1.2 is not compiled to support multithreading
You compiled 2.1.2 yourself and it has the same problem

Is that correct?

Them, hmmm. 🤔 I use 2.1.2 daily on our cluster with Infiniband and we see no problem. Can you make sure your program hangs on Allreduce? Does it reproduce with a very simple test case program?

keisukefukuda commented 6 years ago

@andremoeller,

What interconnect do you use? I guess it's Infiniband because you use NCCL. If so, will you try pure_nccl_communicator? It should solve the problem if MPI_Allreduce is the problem.

keisukefukuda commented 6 years ago

I'm closing the issue, but don't hesitate to re-open it if you guys still have a problem.Thanks.

ankahira commented 5 years ago

I am having the same issue. Works fine with single node but hangs on 2 (multiple) nodes on ABCI

keisukefukuda commented 5 years ago

Hi @ankahira , can you please provide some more details, such as your Chainer/CuPy & MPI versions? It's been a while since this issue was closed once. Thanks!

ankahira commented 5 years ago

@keisuke-umezawa I figured out the issue. Unlike Slurm, the cluster manager on ABCI doesn't specify the number of tasks to launch on each node. So it was starting all the tasks in the same node. I forced mpirun to start on different nodes using "mpirun -n 16 --map-by node --oversubscribe --hostfile"

keisukefukuda commented 5 years ago

Great, I guess you can also use '-N' option of Open MPI, or specify proc/node numbers in the hostfile like

hostA slots=8
hostB slots=8

BTW, I'm @keisukefukuda , not keisuke-umezawa.

chainer / chainermn

Multi-GPU training hangs #217