Open karalets opened 8 years ago
Why do you feel the GPUs are not in use?
Because I checked for the running gpu jobs and found very few.
How are you checking? As I do not see that. Removing username.
7205572.hal-sched1.loc USER gpu i_hsa_dansylamid 13454 3 12 -- 72:00:00 R 49:58:48 gpu-1-16/15,20,24,34+gpu-1-17/2-3,13,24+gpu-2-7/32-35
7205573.hal-sched1.loc USER gpu i_hsa_dansylglyc 20518 3 12 -- 72:00:00 R 49:57:32 gpu-1-7/32-35+gpu-2-6/13,28,31-32+gpu-2-9/11-12,20,27
7205574.hal-sched1.loc USER gpu i_hsa_indomethac 13848 3 12 -- 72:00:00 R 49:55:40 gpu-1-13/27,29,32,34+gpu-1-15/15-17,28+gpu-1-11/32-35
7205575.hal-sched1.loc USER gpu i_hsa_lapatinib 5463 3 12 -- 72:00:00 R 49:54:25 gpu-1-12/25,33-35+gpu-1-10/32-35+gpu-1-6/28-30,35
7212730.hal-sched1.loc USER gpu i_hsa_naproxen 23815 3 12 -- 72:00:00 R 26:58:28 gpu-2-12/15,32-34+gpu-1-4/19,32-34+gpu-3-9/32-35
7212731.hal-sched1.loc USER gpu i_hsa_phenylbuta 993 3 12 -- 72:00:00 R 26:09:30 gpu-2-11/32-35+gpu-1-14/32-35+gpu-2-5/19-20,25-26
7212733.hal-sched1.loc USER gpu i_hsa_ponatinib 17700 3 12 -- 72:00:00 R 26:08:09 gpu-2-17/23,33-35+gpu-3-8/7,33-35+gpu-1-5/2,32-34
7212769.hal-sched1.loc USER gpu hsa_dansylamide 1332 2 8 -- 48:00:00 R 25:12:37 gpu-2-16/24,32-34+gpu-1-8/29,31,33,35
Double checking but that sure seems like a fair chunk.
The program is called "yank" and I show it on lots of GPUs.
yank
is a program from our lab, and does use lots of GPUs. I'm looking into trying to figure out exactly which GPUs are free at the moment.
Oddly enough, there are some GPUs that are running jobs from two different people, which shouldn't be happening:
gpu-2-5
Tue May 17 17:11:59 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:03:00.0 Off | N/A |
| 22% 53C P2 83W / 250W | 521MiB / 12287MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 51C P2 82W / 250W | 522MiB / 12287MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 0000:83:00.0 Off | N/A |
| 22% 57C P2 89W / 250W | 522MiB / 12287MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TIT... Off | 0000:84:00.0 Off | N/A |
| 22% 55C P2 81W / 250W | 521MiB / 12287MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6897 C /cbio/jclab/home/misik/anaconda/bin/python 201MiB |
| 0 13653 C /cbio/jclab/home/luirink/anaconda/bin/python 170MiB |
| 0 31953 C /cbio/jclab/home/misik/anaconda/bin/python 122MiB |
| 1 6896 C /cbio/jclab/home/misik/anaconda/bin/python 201MiB |
| 1 13652 C /cbio/jclab/home/luirink/anaconda/bin/python 171MiB |
| 1 31952 C /cbio/jclab/home/misik/anaconda/bin/python 122MiB |
| 2 6895 C /cbio/jclab/home/misik/anaconda/bin/python 201MiB |
| 2 13651 C /cbio/jclab/home/luirink/anaconda/bin/python 171MiB |
| 2 31951 C /cbio/jclab/home/misik/anaconda/bin/python 122MiB |
| 3 6894 C /cbio/jclab/home/misik/anaconda/bin/python 201MiB |
| 3 13650 C /cbio/jclab/home/luirink/anaconda/bin/python 170MiB |
| 3 31950 C /cbio/jclab/home/misik/anaconda/bin/python 122MiB |
+-----------------------------------------------------------------------------+
gpu-2-11
Tue May 17 17:12:04 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TITAN Off | 0000:03:00.0 Off | N/A |
| 36% 54C P0 70W / 250W | 228MiB / 6143MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TITAN Off | 0000:04:00.0 Off | N/A |
| 36% 56C P0 72W / 250W | 228MiB / 6143MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TITAN Off | 0000:83:00.0 Off | N/A |
| 38% 60C P0 72W / 250W | 228MiB / 6143MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TITAN Off | 0000:84:00.0 Off | N/A |
| 42% 68C P0 157W / 250W | 227MiB / 6143MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1049 C /cbio/jclab/home/misik/anaconda/bin/python 80MiB |
| 0 10711 C /cbio/jclab/home/luirink/anaconda/bin/python 129MiB |
| 1 1048 C /cbio/jclab/home/misik/anaconda/bin/python 80MiB |
| 1 10710 C /cbio/jclab/home/luirink/anaconda/bin/python 129MiB |
| 2 1047 C /cbio/jclab/home/misik/anaconda/bin/python 80MiB |
| 2 10709 C /cbio/jclab/home/luirink/anaconda/bin/python 130MiB |
| 3 1046 C /cbio/jclab/home/misik/anaconda/bin/python 80MiB |
| 3 10708 C /cbio/jclab/home/luirink/anaconda/bin/python 128MiB |
+-----------------------------------------------------------------------------+
alright, I misinterpreted that and thought these are single jobs/gpus.
Thank you, Paul. I will probably need to deal with this diplomatically ;)
I show luirink
as having no running jobs, so her jobs should not be still running on GPUs:
[chodera@mskcc-ln1 ~/scripts]$ qstat -u luirink
[chodera@mskcc-ln1 ~/scripts]$
It looks like this is related to https://github.com/cBio/cbio-cluster/issues/409#issuecomment-218166763
John, do you think it would be possible to keep 5-10 gpus useable or unoccupied by yank until friday (NIPS deadline)?
I really will not be needing any more, but just having 3 as is the case now is a bit tight.
I show processes by luirink on gpu-2-5. If the code is not exiting cleanly it will need to be fixed.
luirink 13650 93.9 0.2 308770540 739620 ? Rl May11 8806:31 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit
luirink 13651 94.4 0.2 308770796 743364 ? Rl May11 8848:40 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit
luirink 13652 92.8 0.2 308770800 729816 ? Rl May11 8704:07 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit
luirink 13653 94.2 0.2 308770536 739504 ? Rl May11 8834:58 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit
But basically @karalets the statement in this Git that "the cluster seems to prefer to let the gpus gather dust over giving them to me" is incorrect and I'm still waiting for how you determined that so we can resolve the actually requested support. I show yank consuming all the GPUs.
I see your note @karalets. Ignore. If you want my methods of parsing the qstat I can document.
@MehtapIsik: Can you halt one or more of your yank
jobs right now so @karalets can use some GPUs?
Paul @tatarsky , I must have made a mistake as said earlier, I was just counting the number of jobs, not the number of gpus each one of these jobs was using. So please put my comments down to my ignorance of the cluster-diagnostics.
Noted your Git response already. Just an ordering. I am happy to share my method of determining.
I was doing 'qstat | grep gpu' which obviously was suboptimal.
Yeah that doesn't work.
My output was from a cheap quick way. There are others.
qstat -1tnr|grep " gpu "
I'll look into why mpirun
is not causing all processes to cleanly exit when terminated.
@tatarsky: Have you seen threads like this that note that this might be due to the way in which mpirun/mpiexec
launches processes in way that Torque is unable to keep track of? I wonder if there's a different argument we can pass to mpirun
that ensure Torque is notified of processes it should track.
Can check the old Relion notes which I dimly recall something similar.
The hydra mpirun
has the following options that may be relevant here:
Hydra specific options (treated as global):
Launch options:
-launcher launcher to use ( ssh rsh fork slurm ll lsf sge manual persist)
-launcher-exec executable to use to launch processes
@tatarsky thank you, I will use that henceforth.
@jchodera that would be great if we could get some of the yank jobs to run later if that does not hurt too much.
(@karalets: I'm in Boston, but @MehtapIsik is just upstairs in Z11 if you want to go find her in person!)
There is probably a better way but thats my cheap way. The spaces around "gpu" prevent grepping the name of the host which contains gpu as well.
There's also this:
Resource management kernel options:
-rmk resource management kernel to use ( user slurm ll lsf sge pbs)
This page suggests:
2) Hydra integrates with PBS-like DRMs (PBSPro, Torque). The integration means that, for example, you don't have to provide a list of hosts to mpiexec since the list of granted nodes is obtained automatically. It also uses the native
tm
interface of PBS to launch and monitor remote processes.
I'm still investigating whether this requires we specify a specify flag to mpirun
.
My Relion memory may not be relevant. That issue involved a wrapper script that was not then properly providing mpirun with the nodefile or blocking its ability to get the Torque environment provided one.
At least that is my brief review of the old issue.
The MPI "tm" interface was indeed discussed. But I don't recall a flag for more better exiting. Git #329 has all sorts of stuff in it. Probably of no relevance.
Do you wish all luirink node processes killed off before I scrub the name from the Git page? Or does it help to leave some to debug why they are not dead?
I'm changing the Git name so I can focus on that part. I will mention there is a Tesla sitting idle in cc27
@karalets
Hmm. I may need to adjust that systems gpu queue oversubscription rule though. Let me check on that.
Something on that system isn't as I expect so ignore that comment until I can figure it out. The batch overflow jobs are not allowing the gpu to be scheduled there. Even though it is one gpu it would be nice to see it used. I will re-open the item involving this system.
Apologies for the delay---in Boston for a conference I have organized.
Do you wish all luirink node processes killed off before I scrub the name from the Git page?
Please do if they have not already terminated!
Debugging the hung process issue will probably involve launching new MPI jobs with various combinations of mpirun
arguments, killing them through the queue, and checking if all spawned processes were correctly killed.
I have some leads based on this thread but it looks like debugging will take days or weeks. Current queue time for 4 GPUs is 15 hours. Will report back after this job [7224425] runs (with 10 min wallclock limit) and terminates.
@karalets : Did you get a hold of @MehtapIsik in person to curtail her jobs? I seem to be unable to reach her via GitHub or email.
I wonder if we need a way for group leaders to kill jobs from group members as an emergency backup for cases like this...?
Finally got a hold of her! She will reduce GPU usage. Sorry for the delay.
I am looking for remaining luirink processes.
I do not see any such processes this morning unless I am doing something incorrect.
Sorry to reopen, but I just got some GPUs with a bunch of dead jobs still running although they are not on the scheduler anymore.
For instance on gpu-2-11.
This is pretty serious, the GPUs are effectively dead when this happens. Can everyone who runs these jobs please check periodically for any holdover jobs until it is solved automatically? It would suffice to just kill these jobs manually once a job finishes on the scheduler, especially now that we are aware of it.
Best,
Theo
Sorry about this. This has been difficult to debug because long queue times mean I only get to try one thing a day to see if cleanup is problematic. My tests have not yet been able to reproduce this so far.
I wonder if it would be possible for us to have exclusive batch queue access to the GPUs on a single node to quickly debug over the weekend? Maybe if the gpu
label was changed to gpudebug
for the weekend?
I do not mean to complain, I will just keep pointing out nodes I come across on which it happens until it is solved. I understand it is tedious.
Is there a way to get a list of all processes running on all GPU's at once? I am checking all possible affected GPUs one by one, by ssh'ing into them. I found a few more GPUs with holdover yank jobs (from my timed-out yank jobs on the queue).
@jchodera changing the node gpu property label will only stop GPU queue requests that request a particular type of gpu (aka gtx680) from going there. If the qsub does not specify that resource tag the GPU is still schedulable by any queue gpu request.
But you could at least predict the node to monitor. I can also ADD a property such as "gpudebug" to a set of nodes to help you get on a set of nodes more predictable. Tell me if that helps more.
I can also offline a node but suspect you need Torque to fully debug.
I am happy to change the tag on a single node however but wanted the above clear.
Have you folks considered epilog scripts to clean up or does MPI make that not work.
@MehtapIsik I tend to use pdsh for rapid checks and there is a group defined for gpu nodes:
pdsh -g gpu "ps aux|grep yank"
or grep or whatever method. Currently pdsh class "gpu" contains the Fuchs group nodes as well as they contain gpus but are not in the gpu queue at this time.
Also there are a fair number of free gpus at the moment.
Ah but slots/ram are a bit short. Let me see if I can tweak something there.
I added the node property gpudebug
to a set of titan nodes gpu-2-14-17. It at least might help you narrow down the systems a test job will get scheduled. I LEFT the gtxtitan resource.
I do not know enough about user reservations to on a Friday try to guarantee you get priority on those nodes. Perhaps next week.
Hmm. And it may not be working as I expect. So if it doesn't seem to do what you need I'll have to revisit it.
There are however now several free gpus on the cluster but don't know how long it will last.
Thanks! Will debug over the weekend! Traveling back from Boston tonight.
@karalets: This is a serious issue and needs to be addressed ASAP! Thanks for pointing it out!
Hi,
I am trying to get some gpus in interactive mode and I am really having a hard time getting them. Normally this would be fine and I'd wait until the cluster clears up, but... the gpus are simply not really in use, so I am not competing against anybody to get them.
As such, the cluster seems to prefer to let the gpus gather dust over giving them to me. Shocker!
Is there any explanation for this? Or, to be actionable: can I do something to change that?
Best and thanks,
Theo