Open kyleabeauchamp opened 11 years ago
Yes, this is EXACTLY the behavior I'm seeing with CUDA on our new Exxact node. Peter and I are trying to track down the issue.
It looks like nvcc is spin-locking on the other processes. Can you do a 'ps xauwww | grep nvcc' when these processes are hanging and see if this is what you're seeing too?
This was Peter Eastman's suggestion. I still need to try the tests he mentions.
I have a simple mpi4py test I'll send you as well.
John
---------- Forwarded message ---------- From: Peter Eastman peastman@stanford.edu
Let's consider what we know here.
nvcc is being successfully launched, since we can see it in the ps output. Furthermore, we can see it's using 99% of a core, so it clearly is doing something. Assuming it's spinning while waiting for a lock (a reasonable hypothesis, but not at all certain), it's a lock that nvcc itself looks for, not anything in OpenMM.
Waiting for nvcc to finish in one process does not allow it to work when called from another process. So if it's a lock, that lock does not get released when nvcc itself exits but the parent process does not. (Or there's a bug somewhere that keeps it from realizing the lock is released.)
We haven't determined yet whether the parent process exiting allows nvcc to succeed in another process.
A second script which does not use MPI but otherwise is similar (launching several processes, each one compiles kernels at the same time) does work. This seems very odd. Are there any other obvious differences in what the scripts are doing? Does MPI do anything "strange" to the processes it creates?
What happens if you launch two independent MPI jobs at the same time, each of which creates a single process?
What MPI implementation are you using? Is it one of the ones that has special CUDA features built in?
Peter
Here's what's running:
kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb 29699 0.0 0.0 4400 604 pts/1 S+ 16:48 0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "/tmp/openmmTempKernel0x1950f80.ptx" --use_fast_math "/tmp/openmmTempKernel0x1950f80.cu" 2> "/tmp/openmmTempKernel0x1950f80.log"
kyleb 29701 99.1 0.0 6680 732 pts/1 R+ 16:48 1:21 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o /tmp/openmmTempKernel0x1950f80.ptx --use_fast_math /tmp/openmmTempKernel0x1950f80.cu
kyleb 30062 0.0 0.0 13584 900 pts/2 S+ 16:49 0:00 grep --color=auto nvcc
mpitest.py seems to hang as well.
kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb 30386 0.0 0.0 4400 604 pts/1 S+ 16:52 0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "/tmp/openmmTempKernel0x2034060.ptx" --use_fast_math "/tmp/openmmTempKernel0x2034060.cu" 2> "/tmp/openmmTempKernel0x2034060.log"
kyleb 30389 99.8 0.0 6684 736 pts/1 R+ 16:52 1:15 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o /tmp/openmmTempKernel0x2034060.ptx --use_fast_math /tmp/openmmTempKernel0x2034060.cu
kyleb 30561 0.0 0.0 13584 896 pts/2 S+ 16:53 0:00 grep --color=auto nvcc
It looks like that nvcc is at 99.1% CPU utilization, which suggested to Peter and I some sort of spin lock.
Even if we manually set each process's CudaTempDirectory platform property to be a different directory, this isn't sufficient to get past the spin-lock.
OK, so this simple example can recapitulate the buggy behavior, which is good.
Can you run two copies of the mpitest.py with one process each and see if that works?
I'm off to a meeting for an hour or so, I'll run it this evening.
Thanks!
I ran this in two screens and it runs fine.
mpirun -np 1 ~/src/yank/src/mpitest.py
Great! I think this means either (1) the fact that the mpirun-spawned processes have a parent process is causing trouble, or (2) mpirun is doing something funny to the spawned processes that makes them different from shell-spawned processes, and this is doing something funny in turn with nvcc and spinlocks.
Actually, I wonder if it has something to do with the fact that the mpirun processes are launched at exactly the same time. This could cause some random number seed to be the same for generating temporary file names...
Still hangs
kyleb@amd6core:~$ mpirun -np 2 ~/src/yank/src/mpitest3.py
rank 1/2 platform CUDA deviceid 1
rank 0/2 platform CUDA deviceid 0
rank 1/2 creating context...
rank 0/2 creating context...
rank 0/2 context created in 6.027 s
Does the same thing for me. Do you at least see some additional temporary directories being created and used? And can you send the output of 'ps xauwww | grep nvcc'?
kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb 20905 0.0 0.0 4396 596 pts/0 S+ 18:17 0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "CUDA1/openmmTempKernel0x19cd340.ptx" --use_fast_math "CUDA1/openmmTempKernel0x19cd340.cu" 2> "CUDA1/openmmTempKernel0x19cd340.log"
kyleb 20907 99.5 0.0 6680 732 pts/0 R+ 18:17 3:11 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o CUDA1/openmmTempKernel0x19cd340.ptx --use_fast_math CUDA1/openmmTempKernel0x19cd340.cu
kyleb 21241 0.0 0.0 13580 888 pts/1 S+ 18:20 0:00 grep --color=auto nvcc
I see CUDA0 and CUDA1
I'm running the mpirun that comes with Canopy 1.0.0. Are you by chance running the same version?
[chodera@node05 src]$ mpirun --version HYDRA build details: Version: 1.4.1 Release Date: Wed Aug 24 14:40:04 CDT 2011
kyleb@kb-intel:~/dat/ala-lvbp/amber99$ mpirun --version
HYDRA build details:
Version: 1.4.1p1
Release Date: Thu Sep 1 13:53:02 CDT 2011
CC: gcc
CXX: c++
F77: gfortran
F90: f95
I'm using Anaconda, not Canopy.
I'm curious if the choice of "launcher" has any impact. If you're up to trying a few of the launchers, that would provide some useful information, I think:
Launch options:
-launcher launcher to use ( ssh rsh fork slurm ll lsf sge manual persist)
I think ssh/rsh
and fork
may be the important ones to try.
So I just tried the Ubuntu 12.04 mpich mpirun and found the same results. It's also HYDRA 1.4.1, though.
I tried ssh and fork. Same.
OK, thanks. Still not at all sure what is going on here. Independent processes seem to work totally fine when accessing different GPUs...
I've tried this on two different systems (cluster and desktop) now. I'm finding that Yank hangs when creating the second cached context object. Have you seen anything like this before?