Open karalets opened 8 years ago
Starting to debug now. Thanks again, @tatarsky!
@karalets: I notice that your job 7249830
has reserved the four GPUs on gpu-2-14
, but is only running processes on one of them:
[chodera@gpu-2-14 ~]$ nvidia-smi
Sat May 21 10:43:01 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TITAN Off | 0000:03:00.0 Off | N/A |
| 30% 31C P8 13W / 250W | 15MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TITAN Off | 0000:04:00.0 Off | N/A |
| 30% 31C P8 13W / 250W | 15MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TITAN Off | 0000:83:00.0 Off | N/A |
| 30% 31C P8 12W / 250W | 15MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TITAN Off | 0000:84:00.0 Off | N/A |
| 30% 31C P8 14W / 250W | 205MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 3 29114 C ...grlab/home/karaletsos/anaconda/bin/python 189MiB |
+-----------------------------------------------------------------------------+
This may be intended, so no worries if so---but it seems like it may be a sign something is wrong!
Looks like the 'gtxdebug' and gtxtitan
resources are all booked up for at least a day, which will make it very difficult to debug processes that remain on the GPUs (since these show running processes).
I've tried to replicate this problem on the GTX-680s, but am unable to. The jobs all terminate cleanly when torque kills the master process.
Seeing if I can spot some other titans to apply the tag.
@tatarsky: If sometime this week you have an idea about how to reserve a node of GTX-TITANs for us to use to debug (via torque), I can sit down with @MehtapIsik and interactively try to see if there are any issues with this specifically or with her environment that may be causing this problem. In the meantime, I am unable to reproduce, and can't debug further until GTX-TITAN nodes are free.
@karalets: Will try debugging again tomorrow in case some nodes are free.
BTW the tag is gpudebug
. I've added it to three more and removed it from ones with @karalets jobs
No, I'm wrong. He's got jobs on lots of the gtxtitans
. This will require coordination next week.
Well, also as you note he seems to be requesting four gpus but nvidia-smi shows only 1 gpu in use. In at least a spot check of gpu-2-10.
checkjob -v -v 7249831
job 7249831 (RM job '7249831.hal-sched1.local')
AName: STDIN
State: Running
Creds: user:karaletsos group:grlab class:gpu qos:preemptorgpu
WallTime: 18:10:38 of 3:00:00:00
SubmitTime: Fri May 20 16:51:17
(Time Queued Total: 00:00:03 Eligible: 00:00:03)
StartTime: Fri May 20 16:51:20
TemplateSets: DEFAULT
Total Requested Tasks: 1
Total Requested Nodes: 0
Req[0] TaskCount: 1 Partition: MSKCC
Available Memory >= 0 Available Swap >= 0
Opsys: --- Arch: --- Features: gtxtitan
Dedicated Resources Per Task: PROCS: 1 GPUS: 4
And unless I'm reading this wrong.
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 3 10435 C ...grlab/home/(him)/anaconda/bin/python 192MiB |
+-----------------------------------------------------------------------------+
[root@gpu-2-10 ~]#
@karalets has reserved all four GPUs on those nodes and locked all of the GPUs in thread-exclusive mode, even though he is only using one:
[chodera@gpu-2-12 ~]$ nvidia-smi
Sat May 21 11:02:53 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TITAN Off | 0000:03:00.0 Off | N/A |
| 30% 27C P8 13W / 250W | 15MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TITAN Off | 0000:04:00.0 Off | N/A |
| 30% 30C P8 12W / 250W | 15MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TITAN Off | 0000:83:00.0 Off | N/A |
| 30% 29C P8 13W / 250W | 15MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TITAN Off | 0000:84:00.0 Off | N/A |
| 30% 29C P8 14W / 250W | 205MiB / 6143MiB | 0% E. Thread |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 3 13278 C ...grlab/home/karaletsos/anaconda/bin/python 189MiB |
+-----------------------------------------------------------------------------+
I'm guessing this is unintentional.
@karalets: Might want to check what's going on with your code.
Yep. Seeing the same.
In the meantime, thanks for the weekend help, @tatarsky, and let's connect up during the week to further debug if needed!
Fair enough.
I've found a free GTX-TITAN-X node (gpu-2-5
) and am testing that now.
I have opened up some of them after reading this.
Thanks!
@karalets: Was the use of 1/4 GPUs expected?
I sometimes reserve a bunch and use a variable amount when I am trying out new code. This is interactive, so it mimics an environment where I can debug and try out stuff. The merits of not having a devbox, I guess.
Ah, OK! Thanks for the clarification!
I can reproduce this on gpu-2-5
, the GTX-TITAN-X node! That's a start!
I've tried this dozens of times, but I can't seem to consistently reproduce this problem. It happened once, but I don't seem to be able to get it to happen again.
I suspect it mostly happens when queue jobs time out.
I'm currently trying to harden the YANK code with an explicit call to MPI.Abort()
on interrupt, following this thread.
@tatarsky: Do you know what signal Torque sends when killing jobs that hit their resource limits? The following dump to stdout/stderr
suggests a signal 15 (SIGTERM
) is sent, and that I should intercept this and make sure MPI.Abort()
is explicitly called before actual termination:
=>> PBS: job killed: walltime 1340 exceeded limit 1320
Terminated
=====================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 15
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
Are there other signals Torque might send that I should worry about too? I think it might send a SIGKILL
a little later if the process still hasn't terminated?
I believe SIGTERM is correct. And it does indeed send SIGKILL after some period which I dimly recall is also tunable but defaults to 60 seconds.
There is also a command called qsig
which can send signals to a job IIRC if you wanted to test.
I believe it also logs when it sends the signal with something like this:
20160329:03/29/2016 10:07:56;0008;PBS_Server.2241;Job;7058923.hal-sched1.local;Job sent signal SIGTERM on delete
"I suspect it mostly happens when queue jobs time out."
I assume this means when the jobs hit a walltime limit and are killed. Is there a major problem in simply stating the walltime higher and letting the jobs complete without such an event?
Or is the walltime limit being used to control the usage of the job.
Its a subtle point, but why not set the walltime to a value that better matches the needs of the job.
The actual jobs may take many days, but the walltime limit is being used to break the jobs into more queue-neighbor-friendly chunks. So it is a significant problem if our code doesn't cleanly exit when requested to do so!
Still tinkering with MPI.Abort() calls...
Hi,
I am trying to get some gpus in interactive mode and I am really having a hard time getting them. Normally this would be fine and I'd wait until the cluster clears up, but... the gpus are simply not really in use, so I am not competing against anybody to get them.
As such, the cluster seems to prefer to let the gpus gather dust over giving them to me. Shocker!
Is there any explanation for this? Or, to be actionable: can I do something to change that?
Best and thanks,
Theo