cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Yank processes not cleaning exiting #415

Open karalets opened 8 years ago

karalets commented 8 years ago

Hi,

I am trying to get some gpus in interactive mode and I am really having a hard time getting them. Normally this would be fine and I'd wait until the cluster clears up, but... the gpus are simply not really in use, so I am not competing against anybody to get them.

As such, the cluster seems to prefer to let the gpus gather dust over giving them to me. Shocker!

Is there any explanation for this? Or, to be actionable: can I do something to change that?

Best and thanks,

Theo

tatarsky commented 8 years ago

Why do you feel the GPUs are not in use?

karalets commented 8 years ago

Because I checked for the running gpu jobs and found very few.

tatarsky commented 8 years ago

How are you checking? As I do not see that. Removing username.

7205572.hal-sched1.loc  USER       gpu      i_hsa_dansylamid  13454     3     12    --   72:00:00 R  49:58:48   gpu-1-16/15,20,24,34+gpu-1-17/2-3,13,24+gpu-2-7/32-35
7205573.hal-sched1.loc  USER       gpu      i_hsa_dansylglyc  20518     3     12    --   72:00:00 R  49:57:32   gpu-1-7/32-35+gpu-2-6/13,28,31-32+gpu-2-9/11-12,20,27
7205574.hal-sched1.loc  USER       gpu      i_hsa_indomethac  13848     3     12    --   72:00:00 R  49:55:40   gpu-1-13/27,29,32,34+gpu-1-15/15-17,28+gpu-1-11/32-35
7205575.hal-sched1.loc  USER       gpu      i_hsa_lapatinib    5463     3     12    --   72:00:00 R  49:54:25   gpu-1-12/25,33-35+gpu-1-10/32-35+gpu-1-6/28-30,35
7212730.hal-sched1.loc  USER       gpu      i_hsa_naproxen    23815     3     12    --   72:00:00 R  26:58:28   gpu-2-12/15,32-34+gpu-1-4/19,32-34+gpu-3-9/32-35
7212731.hal-sched1.loc  USER       gpu      i_hsa_phenylbuta    993     3     12    --   72:00:00 R  26:09:30   gpu-2-11/32-35+gpu-1-14/32-35+gpu-2-5/19-20,25-26
7212733.hal-sched1.loc  USER       gpu      i_hsa_ponatinib   17700     3     12    --   72:00:00 R  26:08:09   gpu-2-17/23,33-35+gpu-3-8/7,33-35+gpu-1-5/2,32-34
7212769.hal-sched1.loc  USER       gpu      hsa_dansylamide    1332     2      8    --   48:00:00 R  25:12:37   gpu-2-16/24,32-34+gpu-1-8/29,31,33,35

Double checking but that sure seems like a fair chunk.

tatarsky commented 8 years ago

The program is called "yank" and I show it on lots of GPUs.

jchodera commented 8 years ago

yank is a program from our lab, and does use lots of GPUs. I'm looking into trying to figure out exactly which GPUs are free at the moment.

Oddly enough, there are some GPUs that are running jobs from two different people, which shouldn't be happening:

gpu-2-5
Tue May 17 17:11:59 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:03:00.0     Off |                  N/A |
| 22%   53C    P2    83W / 250W |    521MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:04:00.0     Off |                  N/A |
| 22%   51C    P2    82W / 250W |    522MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:83:00.0     Off |                  N/A |
| 22%   57C    P2    89W / 250W |    522MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:84:00.0     Off |                  N/A |
| 22%   55C    P2    81W / 250W |    521MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      6897    C   /cbio/jclab/home/misik/anaconda/bin/python     201MiB |
|    0     13653    C   /cbio/jclab/home/luirink/anaconda/bin/python   170MiB |
|    0     31953    C   /cbio/jclab/home/misik/anaconda/bin/python     122MiB |
|    1      6896    C   /cbio/jclab/home/misik/anaconda/bin/python     201MiB |
|    1     13652    C   /cbio/jclab/home/luirink/anaconda/bin/python   171MiB |
|    1     31952    C   /cbio/jclab/home/misik/anaconda/bin/python     122MiB |
|    2      6895    C   /cbio/jclab/home/misik/anaconda/bin/python     201MiB |
|    2     13651    C   /cbio/jclab/home/luirink/anaconda/bin/python   171MiB |
|    2     31951    C   /cbio/jclab/home/misik/anaconda/bin/python     122MiB |
|    3      6894    C   /cbio/jclab/home/misik/anaconda/bin/python     201MiB |
|    3     13650    C   /cbio/jclab/home/luirink/anaconda/bin/python   170MiB |
|    3     31950    C   /cbio/jclab/home/misik/anaconda/bin/python     122MiB |
+-----------------------------------------------------------------------------+

gpu-2-11
Tue May 17 17:12:04 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 36%   54C    P0    70W / 250W |    228MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 36%   56C    P0    72W / 250W |    228MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 38%   60C    P0    72W / 250W |    228MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 42%   68C    P0   157W / 250W |    227MiB /  6143MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1049    C   /cbio/jclab/home/misik/anaconda/bin/python      80MiB |
|    0     10711    C   /cbio/jclab/home/luirink/anaconda/bin/python   129MiB |
|    1      1048    C   /cbio/jclab/home/misik/anaconda/bin/python      80MiB |
|    1     10710    C   /cbio/jclab/home/luirink/anaconda/bin/python   129MiB |
|    2      1047    C   /cbio/jclab/home/misik/anaconda/bin/python      80MiB |
|    2     10709    C   /cbio/jclab/home/luirink/anaconda/bin/python   130MiB |
|    3      1046    C   /cbio/jclab/home/misik/anaconda/bin/python      80MiB |
|    3     10708    C   /cbio/jclab/home/luirink/anaconda/bin/python   128MiB |
+-----------------------------------------------------------------------------+
karalets commented 8 years ago

alright, I misinterpreted that and thought these are single jobs/gpus.

Thank you, Paul. I will probably need to deal with this diplomatically ;)

jchodera commented 8 years ago

I show luirink as having no running jobs, so her jobs should not be still running on GPUs:

[chodera@mskcc-ln1 ~/scripts]$ qstat -u luirink
[chodera@mskcc-ln1 ~/scripts]$ 

It looks like this is related to https://github.com/cBio/cbio-cluster/issues/409#issuecomment-218166763

karalets commented 8 years ago

John, do you think it would be possible to keep 5-10 gpus useable or unoccupied by yank until friday (NIPS deadline)?

I really will not be needing any more, but just having 3 as is the case now is a bit tight.

tatarsky commented 8 years ago

I show processes by luirink on gpu-2-5. If the code is not exiting cleanly it will need to be fixed.

luirink  13650 93.9  0.2 308770540 739620 ?    Rl   May11 8806:31 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit
luirink  13651 94.4  0.2 308770796 743364 ?    Rl   May11 8848:40 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit
luirink  13652 92.8  0.2 308770800 729816 ?    Rl   May11 8704:07 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit
luirink  13653 94.2  0.2 308770536 739504 ?    Rl   May11 8834:58 /cbio/jclab/home/luirink/anaconda/bin/python /cbio/jclab/home/luirink/anaconda/bin/yank run --store=output --verbose --mpi -i 5000 --platform=OpenCL --phase=complex-explicit
tatarsky commented 8 years ago

But basically @karalets the statement in this Git that "the cluster seems to prefer to let the gpus gather dust over giving them to me" is incorrect and I'm still waiting for how you determined that so we can resolve the actually requested support. I show yank consuming all the GPUs.

tatarsky commented 8 years ago

I see your note @karalets. Ignore. If you want my methods of parsing the qstat I can document.

jchodera commented 8 years ago

@MehtapIsik: Can you halt one or more of your yank jobs right now so @karalets can use some GPUs?

karalets commented 8 years ago

Paul @tatarsky , I must have made a mistake as said earlier, I was just counting the number of jobs, not the number of gpus each one of these jobs was using. So please put my comments down to my ignorance of the cluster-diagnostics.

tatarsky commented 8 years ago

Noted your Git response already. Just an ordering. I am happy to share my method of determining.

karalets commented 8 years ago

I was doing 'qstat | grep gpu' which obviously was suboptimal.

tatarsky commented 8 years ago

Yeah that doesn't work.

tatarsky commented 8 years ago

My output was from a cheap quick way. There are others.

qstat -1tnr|grep " gpu "
jchodera commented 8 years ago

I'll look into why mpirun is not causing all processes to cleanly exit when terminated.

@tatarsky: Have you seen threads like this that note that this might be due to the way in which mpirun/mpiexec launches processes in way that Torque is unable to keep track of? I wonder if there's a different argument we can pass to mpirun that ensure Torque is notified of processes it should track.

tatarsky commented 8 years ago

Can check the old Relion notes which I dimly recall something similar.

jchodera commented 8 years ago

The hydra mpirun has the following options that may be relevant here:

Hydra specific options (treated as global):

  Launch options:
    -launcher                        launcher to use ( ssh rsh fork slurm ll lsf sge manual persist)
    -launcher-exec                   executable to use to launch processes
karalets commented 8 years ago

@tatarsky thank you, I will use that henceforth.

@jchodera that would be great if we could get some of the yank jobs to run later if that does not hurt too much.

jchodera commented 8 years ago

(@karalets: I'm in Boston, but @MehtapIsik is just upstairs in Z11 if you want to go find her in person!)

tatarsky commented 8 years ago

There is probably a better way but thats my cheap way. The spaces around "gpu" prevent grepping the name of the host which contains gpu as well.

jchodera commented 8 years ago

There's also this:

  Resource management kernel options:
    -rmk                             resource management kernel to use ( user slurm ll lsf sge pbs)
jchodera commented 8 years ago

This page suggests:

2) Hydra integrates with PBS-like DRMs (PBSPro, Torque). The integration means that, for example, you don't have to provide a list of hosts to mpiexec since the list of granted nodes is obtained automatically. It also uses the native tm interface of PBS to launch and monitor remote processes.

I'm still investigating whether this requires we specify a specify flag to mpirun.

tatarsky commented 8 years ago

My Relion memory may not be relevant. That issue involved a wrapper script that was not then properly providing mpirun with the nodefile or blocking its ability to get the Torque environment provided one.

At least that is my brief review of the old issue.

The MPI "tm" interface was indeed discussed. But I don't recall a flag for more better exiting. Git #329 has all sorts of stuff in it. Probably of no relevance.

tatarsky commented 8 years ago

Do you wish all luirink node processes killed off before I scrub the name from the Git page? Or does it help to leave some to debug why they are not dead?

tatarsky commented 8 years ago

I'm changing the Git name so I can focus on that part. I will mention there is a Tesla sitting idle in cc27 @karalets

tatarsky commented 8 years ago

Hmm. I may need to adjust that systems gpu queue oversubscription rule though. Let me check on that.

tatarsky commented 8 years ago

Something on that system isn't as I expect so ignore that comment until I can figure it out. The batch overflow jobs are not allowing the gpu to be scheduled there. Even though it is one gpu it would be nice to see it used. I will re-open the item involving this system.

jchodera commented 8 years ago

Apologies for the delay---in Boston for a conference I have organized.

Do you wish all luirink node processes killed off before I scrub the name from the Git page?

Please do if they have not already terminated!

jchodera commented 8 years ago

Debugging the hung process issue will probably involve launching new MPI jobs with various combinations of mpirun arguments, killing them through the queue, and checking if all spawned processes were correctly killed.

jchodera commented 8 years ago

I have some leads based on this thread but it looks like debugging will take days or weeks. Current queue time for 4 GPUs is 15 hours. Will report back after this job [7224425] runs (with 10 min wallclock limit) and terminates.

jchodera commented 8 years ago

@karalets : Did you get a hold of @MehtapIsik in person to curtail her jobs? I seem to be unable to reach her via GitHub or email.

jchodera commented 8 years ago

I wonder if we need a way for group leaders to kill jobs from group members as an emergency backup for cases like this...?

jchodera commented 8 years ago

Finally got a hold of her! She will reduce GPU usage. Sorry for the delay.

tatarsky commented 8 years ago

I am looking for remaining luirink processes.

tatarsky commented 8 years ago

I do not see any such processes this morning unless I am doing something incorrect.

karalets commented 8 years ago

Sorry to reopen, but I just got some GPUs with a bunch of dead jobs still running although they are not on the scheduler anymore.

For instance on gpu-2-11.

This is pretty serious, the GPUs are effectively dead when this happens. Can everyone who runs these jobs please check periodically for any holdover jobs until it is solved automatically? It would suffice to just kill these jobs manually once a job finishes on the scheduler, especially now that we are aware of it.

Best,

Theo

jchodera commented 8 years ago

Sorry about this. This has been difficult to debug because long queue times mean I only get to try one thing a day to see if cleanup is problematic. My tests have not yet been able to reproduce this so far.

I wonder if it would be possible for us to have exclusive batch queue access to the GPUs on a single node to quickly debug over the weekend? Maybe if the gpu label was changed to gpudebug for the weekend?

karalets commented 8 years ago

I do not mean to complain, I will just keep pointing out nodes I come across on which it happens until it is solved. I understand it is tedious.

MehtapIsik commented 8 years ago

Is there a way to get a list of all processes running on all GPU's at once? I am checking all possible affected GPUs one by one, by ssh'ing into them. I found a few more GPUs with holdover yank jobs (from my timed-out yank jobs on the queue).

tatarsky commented 8 years ago

@jchodera changing the node gpu property label will only stop GPU queue requests that request a particular type of gpu (aka gtx680) from going there. If the qsub does not specify that resource tag the GPU is still schedulable by any queue gpu request.

But you could at least predict the node to monitor. I can also ADD a property such as "gpudebug" to a set of nodes to help you get on a set of nodes more predictable. Tell me if that helps more.

I can also offline a node but suspect you need Torque to fully debug.

I am happy to change the tag on a single node however but wanted the above clear.

Have you folks considered epilog scripts to clean up or does MPI make that not work.

tatarsky commented 8 years ago

@MehtapIsik I tend to use pdsh for rapid checks and there is a group defined for gpu nodes:

pdsh -g gpu "ps aux|grep yank"

or grep or whatever method. Currently pdsh class "gpu" contains the Fuchs group nodes as well as they contain gpus but are not in the gpu queue at this time.

tatarsky commented 8 years ago

Also there are a fair number of free gpus at the moment.

tatarsky commented 8 years ago

Ah but slots/ram are a bit short. Let me see if I can tweak something there.

tatarsky commented 8 years ago

I added the node property gpudebug to a set of titan nodes gpu-2-14-17. It at least might help you narrow down the systems a test job will get scheduled. I LEFT the gtxtitan resource.

I do not know enough about user reservations to on a Friday try to guarantee you get priority on those nodes. Perhaps next week.

tatarsky commented 8 years ago

Hmm. And it may not be working as I expect. So if it doesn't seem to do what you need I'll have to revisit it.

tatarsky commented 8 years ago

There are however now several free gpus on the cluster but don't know how long it will last.

jchodera commented 8 years ago

Thanks! Will debug over the weekend! Traveling back from Boston tonight.

@karalets: This is a serious issue and needs to be addressed ASAP! Thanks for pointing it out!