cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Yank processes not cleaning exiting #415

Open karalets opened 8 years ago

karalets commented 8 years ago

Hi,

I am trying to get some gpus in interactive mode and I am really having a hard time getting them. Normally this would be fine and I'd wait until the cluster clears up, but... the gpus are simply not really in use, so I am not competing against anybody to get them.

As such, the cluster seems to prefer to let the gpus gather dust over giving them to me. Shocker!

Is there any explanation for this? Or, to be actionable: can I do something to change that?

Best and thanks,

Theo

jchodera commented 8 years ago

Starting to debug now. Thanks again, @tatarsky!

@karalets: I notice that your job 7249830 has reserved the four GPUs on gpu-2-14, but is only running processes on one of them:

[chodera@gpu-2-14 ~]$ nvidia-smi
Sat May 21 10:43:01 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 30%   31C    P8    13W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 30%   31C    P8    13W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 30%   31C    P8    12W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 30%   31C    P8    14W / 250W |    205MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    3     29114    C   ...grlab/home/karaletsos/anaconda/bin/python   189MiB |
+-----------------------------------------------------------------------------+

This may be intended, so no worries if so---but it seems like it may be a sign something is wrong!

jchodera commented 8 years ago

Looks like the 'gtxdebug' and gtxtitan resources are all booked up for at least a day, which will make it very difficult to debug processes that remain on the GPUs (since these show running processes).

I've tried to replicate this problem on the GTX-680s, but am unable to. The jobs all terminate cleanly when torque kills the master process.

tatarsky commented 8 years ago

Seeing if I can spot some other titans to apply the tag.

jchodera commented 8 years ago

@tatarsky: If sometime this week you have an idea about how to reserve a node of GTX-TITANs for us to use to debug (via torque), I can sit down with @MehtapIsik and interactively try to see if there are any issues with this specifically or with her environment that may be causing this problem. In the meantime, I am unable to reproduce, and can't debug further until GTX-TITAN nodes are free.

@karalets: Will try debugging again tomorrow in case some nodes are free.

tatarsky commented 8 years ago

BTW the tag is gpudebug. I've added it to three more and removed it from ones with @karalets jobs

tatarsky commented 8 years ago

No, I'm wrong. He's got jobs on lots of the gtxtitans. This will require coordination next week.

tatarsky commented 8 years ago

Well, also as you note he seems to be requesting four gpus but nvidia-smi shows only 1 gpu in use. In at least a spot check of gpu-2-10.

checkjob -v -v 7249831
job 7249831 (RM job '7249831.hal-sched1.local')

AName: STDIN
State: Running 
Creds:  user:karaletsos  group:grlab  class:gpu  qos:preemptorgpu
WallTime:   18:10:38 of 3:00:00:00
SubmitTime: Fri May 20 16:51:17
  (Time Queued  Total: 00:00:03  Eligible: 00:00:03)

StartTime: Fri May 20 16:51:20
TemplateSets:  DEFAULT
Total Requested Tasks: 1
Total Requested Nodes: 0

Req[0]  TaskCount: 1  Partition: MSKCC
Available Memory >= 0  Available Swap >= 0
Opsys: ---  Arch: ---  Features: gtxtitan
Dedicated Resources Per Task: PROCS: 1  GPUS: 4

And unless I'm reading this wrong.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    3     10435    C   ...grlab/home/(him)/anaconda/bin/python   192MiB |
+-----------------------------------------------------------------------------+
[root@gpu-2-10 ~]# 
jchodera commented 8 years ago

@karalets has reserved all four GPUs on those nodes and locked all of the GPUs in thread-exclusive mode, even though he is only using one:

[chodera@gpu-2-12 ~]$ nvidia-smi
Sat May 21 11:02:53 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 30%   27C    P8    13W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 30%   30C    P8    12W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 30%   29C    P8    13W / 250W |     15MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 30%   29C    P8    14W / 250W |    205MiB /  6143MiB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    3     13278    C   ...grlab/home/karaletsos/anaconda/bin/python   189MiB |
+-----------------------------------------------------------------------------+

I'm guessing this is unintentional.

@karalets: Might want to check what's going on with your code.

tatarsky commented 8 years ago

Yep. Seeing the same.

jchodera commented 8 years ago

In the meantime, thanks for the weekend help, @tatarsky, and let's connect up during the week to further debug if needed!

tatarsky commented 8 years ago

Fair enough.

jchodera commented 8 years ago

I've found a free GTX-TITAN-X node (gpu-2-5) and am testing that now.

karalets commented 8 years ago

I have opened up some of them after reading this.

jchodera commented 8 years ago

Thanks!

@karalets: Was the use of 1/4 GPUs expected?

karalets commented 8 years ago

I sometimes reserve a bunch and use a variable amount when I am trying out new code. This is interactive, so it mimics an environment where I can debug and try out stuff. The merits of not having a devbox, I guess.

jchodera commented 8 years ago

Ah, OK! Thanks for the clarification!

jchodera commented 8 years ago

I can reproduce this on gpu-2-5, the GTX-TITAN-X node! That's a start!

jchodera commented 8 years ago

I've tried this dozens of times, but I can't seem to consistently reproduce this problem. It happened once, but I don't seem to be able to get it to happen again.

MehtapIsik commented 8 years ago

I suspect it mostly happens when queue jobs time out.

jchodera commented 8 years ago

I'm currently trying to harden the YANK code with an explicit call to MPI.Abort() on interrupt, following this thread.

@tatarsky: Do you know what signal Torque sends when killing jobs that hit their resource limits? The following dump to stdout/stderr suggests a signal 15 (SIGTERM) is sent, and that I should intercept this and make sure MPI.Abort() is explicitly called before actual termination:

=>> PBS: job killed: walltime 1340 exceeded limit 1320
Terminated

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 15
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)

Are there other signals Torque might send that I should worry about too? I think it might send a SIGKILL a little later if the process still hasn't terminated?

tatarsky commented 8 years ago

I believe SIGTERM is correct. And it does indeed send SIGKILL after some period which I dimly recall is also tunable but defaults to 60 seconds.

There is also a command called qsig which can send signals to a job IIRC if you wanted to test.

I believe it also logs when it sends the signal with something like this:

20160329:03/29/2016 10:07:56;0008;PBS_Server.2241;Job;7058923.hal-sched1.local;Job sent signal SIGTERM on delete
tatarsky commented 8 years ago

"I suspect it mostly happens when queue jobs time out."

I assume this means when the jobs hit a walltime limit and are killed. Is there a major problem in simply stating the walltime higher and letting the jobs complete without such an event?

Or is the walltime limit being used to control the usage of the job.

Its a subtle point, but why not set the walltime to a value that better matches the needs of the job.

jchodera commented 8 years ago

The actual jobs may take many days, but the walltime limit is being used to break the jobs into more queue-neighbor-friendly chunks. So it is a significant problem if our code doesn't cleanly exit when requested to do so!

Still tinkering with MPI.Abort() calls...