test_repex_mpi.py hangs on context creation

kyleabeauchamp commented 11 years ago

I've tried this on two different systems (cluster and desktop) now. I'm finding that Yank hangs when creating the second cached context object. Have you seen anything like this before?

[kyleb@node013 ~]$ mpirun -np 6 ~/src/yank/src/test_repex_mpi.py 
Creating test systems...
Selecting MPI communicator and selecting a GPU device...
Creating test systems...
Selecting MPI communicator and selecting a GPU device...
Creating test systems...
Selecting MPI communicator and selecting a GPU device...
Creating test systems...
Selecting MPI communicator and selecting a GPU device...
Creating test systems...
Selecting MPI communicator and selecting a GPU device...
Creating test systems...
Selecting MPI communicator and selecting a GPU device...
node 'node013' deviceid 1 / 6, MPI rank 1 / 6
node 'node013' deviceid 4 / 6, MPI rank 4 / 6
node 'node013' deviceid 5 / 6, MPI rank 5 / 6
node 'node013' deviceid 3 / 6, MPI rank 3 / 6
node 'node013' deviceid 2 / 6, MPI rank 2 / 6
node 'node013' deviceid 0 / 6, MPI rank 0 / 6
Initialized node 0 / 6
Please cite the following:

        Friedrichs MS, Eastman P, Vaidyanathan V, Houston M, LeGrand S, Beberg AL, Ensign DL, Bruns CM, and Pande VS. Accelerating molecular dynamic simulations on graphics processing units. J. Comput. Chem. 30:864, 2009. DOI: 10.1002/jcc.21209
        Eastman P and Pande VS. OpenMM: A hardware-independent framework for molecular simulations. Comput. Sci. Eng. 12:34, 2010. DOI: 10.1109/MCSE.2010.27
        Eastman P and Pande VS. Efficient nonbonded interactions for molecular dynamics on a graphics processing unit. J. Comput. Chem. 31:1268, 2010. DOI: 10.1002/jcc.21413
        Eastman P and Pande VS. Constant constraint matrix approximation: A robust, parallelizable constraint method for molecular simulations. J. Chem. Theor. Comput. 6:434, 2010. DOI: 10.1021/ct900463w
        Chodera JD and Shirts MR. Replica exchange and expanded ensemble simulations as Gibbs sampling: Simple improvements for enhanced mixing. J. Chem. Phys., in press. arXiv: 1105.5749
Creating and caching Context objects...
Node 0 / 6 creating Context for state 0...
Initialized node 4 / 6
Initialized node 2 / 6Node 4 / 6 creating Context for state 4...

Node 2 / 6 creating Context for state 2...
Initialized node 3 / 6
Node 3 / 6 creating Context for state 3...
Initialized node 5 / 6Initialized node 1 / 6
Node 1 / 6 creating Context for state 1...

Node 5 / 6 creating Context for state 5...
Node 0 / 6: Using platform CUDA
Node 3 / 6: Using platform CUDA
Node 4 / 6: Using platform CUDA
Node 2 / 6: Using platform CUDA
Node 1 / 6: Using platform CUDA
Node 5 / 6: Using platform CUDA
Node 0 / 6: Context creation took 34.206 s
Note 0 / 6: Context creation done.  Waiting for MPI barrier...

jchodera commented 11 years ago

Yes, this is EXACTLY the behavior I'm seeing with CUDA on our new Exxact node. Peter and I are trying to track down the issue.

It looks like nvcc is spin-locking on the other processes. Can you do a 'ps xauwww | grep nvcc' when these processes are hanging and see if this is what you're seeing too?

jchodera commented 11 years ago

This was Peter Eastman's suggestion. I still need to try the tests he mentions.

I have a simple mpi4py test I'll send you as well.

John

---------- Forwarded message ---------- From: Peter Eastman peastman@stanford.edu

Let's consider what we know here.

nvcc is being successfully launched, since we can see it in the ps output. Furthermore, we can see it's using 99% of a core, so it clearly is doing something. Assuming it's spinning while waiting for a lock (a reasonable hypothesis, but not at all certain), it's a lock that nvcc itself looks for, not anything in OpenMM.

Waiting for nvcc to finish in one process does not allow it to work when called from another process. So if it's a lock, that lock does not get released when nvcc itself exits but the parent process does not. (Or there's a bug somewhere that keeps it from realizing the lock is released.)

We haven't determined yet whether the parent process exiting allows nvcc to succeed in another process.

A second script which does not use MPI but otherwise is similar (launching several processes, each one compiles kernels at the same time) does work. This seems very odd. Are there any other obvious differences in what the scripts are doing? Does MPI do anything "strange" to the processes it creates?

What happens if you launch two independent MPI jobs at the same time, each of which creates a single process?

What MPI implementation are you using? Is it one of the ones that has special CUDA features built in?

Peter

kyleabeauchamp commented 11 years ago

Here's what's running:

kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb    29699  0.0  0.0   4400   604 pts/1    S+   16:48   0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "/tmp/openmmTempKernel0x1950f80.ptx" --use_fast_math "/tmp/openmmTempKernel0x1950f80.cu" 2> "/tmp/openmmTempKernel0x1950f80.log"
kyleb    29701 99.1  0.0   6680   732 pts/1    R+   16:48   1:21 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o /tmp/openmmTempKernel0x1950f80.ptx --use_fast_math /tmp/openmmTempKernel0x1950f80.cu
kyleb    30062  0.0  0.0  13584   900 pts/2    S+   16:49   0:00 grep --color=auto nvcc

kyleabeauchamp commented 11 years ago

mpitest.py seems to hang as well.

kyleabeauchamp commented 11 years ago

kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb    30386  0.0  0.0   4400   604 pts/1    S+   16:52   0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "/tmp/openmmTempKernel0x2034060.ptx" --use_fast_math "/tmp/openmmTempKernel0x2034060.cu" 2> "/tmp/openmmTempKernel0x2034060.log"
kyleb    30389 99.8  0.0   6684   736 pts/1    R+   16:52   1:15 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o /tmp/openmmTempKernel0x2034060.ptx --use_fast_math /tmp/openmmTempKernel0x2034060.cu
kyleb    30561  0.0  0.0  13584   896 pts/2    S+   16:53   0:00 grep --color=auto nvcc

jchodera commented 11 years ago

It looks like that nvcc is at 99.1% CPU utilization, which suggested to Peter and I some sort of spin lock.

Even if we manually set each process's CudaTempDirectory platform property to be a different directory, this isn't sufficient to get past the spin-lock.

jchodera commented 11 years ago

OK, so this simple example can recapitulate the buggy behavior, which is good.

Can you run two copies of the mpitest.py with one process each and see if that works?

kyleabeauchamp commented 11 years ago

I'm off to a meeting for an hour or so, I'll run it this evening.

jchodera commented 11 years ago

Thanks!

kyleabeauchamp commented 11 years ago

I ran this in two screens and it runs fine.

mpirun -np 1 ~/src/yank/src/mpitest.py

jchodera commented 11 years ago

Great! I think this means either (1) the fact that the mpirun-spawned processes have a parent process is causing trouble, or (2) mpirun is doing something funny to the spawned processes that makes them different from shell-spawned processes, and this is doing something funny in turn with nvcc and spinlocks.

jchodera commented 11 years ago

Actually, I wonder if it has something to do with the fact that the mpirun processes are launched at exactly the same time. This could cause some random number seed to be the same for generating temporary file names...

kyleabeauchamp commented 11 years ago

Still hangs

kyleabeauchamp commented 11 years ago

kyleb@amd6core:~$ mpirun -np 2 ~/src/yank/src/mpitest3.py 
rank 1/2 platform CUDA deviceid 1
rank 0/2 platform CUDA deviceid 0
rank 1/2 creating context...
rank 0/2 creating context...
rank 0/2 context created in 6.027 s

jchodera commented 11 years ago

Does the same thing for me. Do you at least see some additional temporary directories being created and used? And can you send the output of 'ps xauwww | grep nvcc'?

kyleabeauchamp commented 11 years ago

kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb    20905  0.0  0.0   4396   596 pts/0    S+   18:17   0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "CUDA1/openmmTempKernel0x19cd340.ptx" --use_fast_math "CUDA1/openmmTempKernel0x19cd340.cu" 2> "CUDA1/openmmTempKernel0x19cd340.log"
kyleb    20907 99.5  0.0   6680   732 pts/0    R+   18:17   3:11 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o CUDA1/openmmTempKernel0x19cd340.ptx --use_fast_math CUDA1/openmmTempKernel0x19cd340.cu
kyleb    21241  0.0  0.0  13580   888 pts/1    S+   18:20   0:00 grep --color=auto nvcc

kyleabeauchamp commented 11 years ago

I see CUDA0 and CUDA1

jchodera commented 11 years ago

I'm running the mpirun that comes with Canopy 1.0.0. Are you by chance running the same version?

[chodera@node05 src]$ mpirun --version HYDRA build details: Version: 1.4.1 Release Date: Wed Aug 24 14:40:04 CDT 2011

kyleabeauchamp commented 11 years ago

kyleb@kb-intel:~/dat/ala-lvbp/amber99$ mpirun --version HYDRA build details: Version: 1.4.1p1 Release Date: Thu Sep 1 13:53:02 CDT 2011 CC: gcc
CXX: c++
F77: gfortran
F90: f95

kyleabeauchamp commented 11 years ago

I'm using Anaconda, not Canopy.

jchodera commented 11 years ago

I'm curious if the choice of "launcher" has any impact. If you're up to trying a few of the launchers, that would provide some useful information, I think:

  Launch options:
    -launcher                        launcher to use ( ssh rsh fork slurm ll lsf sge manual persist)

jchodera commented 11 years ago

I think ssh/rsh and fork may be the important ones to try.

kyleabeauchamp commented 11 years ago

So I just tried the Ubuntu 12.04 mpich mpirun and found the same results. It's also HYDRA 1.4.1, though.

I tried ssh and fork. Same.

jchodera commented 11 years ago

OK, thanks. Still not at all sure what is going on here. Independent processes seem to work totally fine when accessing different GPUs...

choderalab / brokenyank

test_repex_mpi.py hangs on context creation #7