Closed andremoeller closed 6 years ago
I'm having the same problem when running on multiple nodes. Multiple GPUs on one node work well, on two nodes and it freezes in training loop. One one system I'm trying to switch between different MPIs and only one is working (I've made sure to reinstall without using pip cache all related python packages after switching to new MPI implementation). On another system after software update I cant find any MPI that works ><
I've so far figured out that it freezes in self.communicator.broadcast_data(target)
of _MultiNodeOptimizer after processing first batch
more precisely, in broadcast_naive() in _communication_utility.py while trying to do mpi_comm.Bcast(buf), buf
being a tuple of cffi backend_buffer and mpi type :-\
https://github.com/chainer/chainermn/blob/master/chainermn/communicators/_communication_utility.py#L81
and my openmpi says it has cuda support
ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true
currently using Open MPI 3.0.0
check_cuda_aware.c returns OK status as well
Thanks for the reports.
First, Alex, could you check if your issue is https://github.com/open-mpi/ompi/issues/3972 This issue has been there for more than half a year, and I have just started investigating it myself. Thus we recommend using Open MPI 2.1.2.
@andremoeller, as you are using 2.1.2, it's wired.
Which version of ChainerMN are you using? 1.2
release, or the master ?
Keisuke, it does indeed very much look like that, I made small example of Bcast from GPU memory though mpi4py and cffi and it freezes as message size goes over around 1K. Will check their sample a bit later to rule out mpi4py influence, but 99% it’s that openmpi issue.
Now about the version. We have 2.1.2 on Tsubame 3 and it was working fine, but turned out to be not supporting multi-threading which I need to do some IO stuff. So I’ve compiled same version of openmpi in userspace and I have the same problem with it.
Hi Keisuke, I'm using
cupy-cuda90==4.0.0b4 chainer==4.0.0b4 chainercv==0.8.0 chainermn==1.2.0
Thanks.
@andremoeller , Oh, sorry, I missed it in your first comment. thanks for the info.
@undertherain I understand that
Is that correct?
Them, hmmm. 🤔 I use 2.1.2 daily on our cluster with Infiniband and we see no problem.
Can you make sure your program hangs on Allreduce
? Does it reproduce with a very simple test case program?
@andremoeller,
What interconnect do you use?
I guess it's Infiniband because you use NCCL.
If so, will you try pure_nccl_communicator
?
It should solve the problem if MPI_Allreduce
is the problem.
I'm closing the issue, but don't hesitate to re-open it if you guys still have a problem.Thanks.
I am having the same issue. Works fine with single node but hangs on 2 (multiple) nodes on ABCI
Hi @ankahira , can you please provide some more details, such as your Chainer/CuPy & MPI versions? It's been a while since this issue was closed once. Thanks!
@keisuke-umezawa I figured out the issue. Unlike Slurm, the cluster manager on ABCI doesn't specify the number of tasks to launch on each node. So it was starting all the tasks in the same node. I forced mpirun to start on different nodes using "mpirun -n 16 --map-by node --oversubscribe --hostfile"
Great, I guess you can also use '-N' option of Open MPI, or specify proc/node numbers in the hostfile like
hostA slots=8
hostB slots=8
BTW, I'm @keisukefukuda , not keisuke-umezawa.
Hi,
I'm trying to run train_mnist.py, with multiple GPUs, but training hangs indefinitely at this point:
mpirun -np 4 python train_mnist.py
I'm using CUDA 9, NCCL 2, cuda-aware OpenMPI 2.1.2, and the these:
strace
on thempirun
says it's just polling:Any clues as to what's going wrong, or how I can figure out more about what might be going wrong?
Thanks.