Open corcra opened 8 years ago
Try the active queue.
qsub -I -q active -l walltime=01:00:00 -l nodes=1:ppn=1:gpus=4:shared:gtxtitans
You shouldn't need to use the active
queue---the constraints should still work. The active
queue just has different priorities. Hm...
Using the active queue didn't fix it.
Although, I just managed to get 'good' (aka conforming to my request) gpus on gg06
and gg01
. Included nodes=1:ppn=1
in the qsub
call, although I don't see how that should be relevant...
This worked correctly for me when I included nodes=1:ppn=4
:
qsub -I -l walltime=04:00:00,nodes=1:ppn=4:gpus=4:shared:gtxtitan -l mem=4G -q gpu
I wonder why omitting the nodes=1:ppn=X
gives incorrect resource requests...
Oh, there are some problems with GPU spillover though. I was allocated gpu-2-14
and found the GPUs are tied up with something already:
[chodera@gpu-2-14 ~]$ nvidia-smi
Thu Jun 9 13:41:35 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.39 Driver Version: 352.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TITAN Off | 0000:03:00.0 Off | N/A |
| 30% 34C P8 14W / 250W | 5872MiB / 6143MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TITAN Off | 0000:04:00.0 Off | N/A |
| 30% 33C P8 14W / 250W | 5771MiB / 6143MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TITAN Off | 0000:83:00.0 Off | N/A |
| 30% 34C P8 14W / 250W | 85MiB / 6143MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TITAN Off | 0000:84:00.0 Off | N/A |
| 30% 34C P8 14W / 250W | 85MiB / 6143MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 23613 C /usr/bin/python 5854MiB |
| 1 5220 C /usr/bin/python 5683MiB |
| 1 23613 C /usr/bin/python 68MiB |
| 2 23613 C /usr/bin/python 68MiB |
| 3 23613 C /usr/bin/python 68MiB |
+-----------------------------------------------------------------------------+
This seems to be a docker job that is using GPUs but didn't request them, or is still running after supposedly being killed by torque
:
1164 5107 0.0 0.0 153340 9356 pts/0 Sl+ Jun08 0:01 docker run -it -v /usr/lib64/libcuda.so:/usr/lib64/libcuda.so -v /usr/lib64/libcuda.so.1:/usr/lib64/libcuda.so.1 -v /usr/lib64/libcuda.so.352.39:/usr/lib64/libcuda.so.352.39 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 --device /dev/nvidia4:/dev/nvidia4 --device /dev/nvidia5:/dev/nvidia5 --device /dev/nvidia6:/dev/nvidia6 --device /dev/nvidia7:/dev/nvidia7 --device /dev/nvidiactl:/dev/nvidiactl --env LD_LIBRARY_PATH=/opt/mpich2/gcc/eth/lib:/opt/gnu/gcc/4.8.1/lib64:/opt/gnu/gcc/4.8.1/lib:/opt/gnu/gmp/lib:/opt/gnu/mpc/lib:/opt/gnu/mpfr/lib:/usr/lib64/ --env CUDA_VISIBLE_DEVICES=1 -v /cbio/grlab/home/dresdnerg/software:/mnt/software -v /cbio/grlab/home/dresdnerg/projects/tissue-microarray-resnet/:/mnt/tma-resnet -it gmd:cudnn4 ipython
It's impossible to tell who is/was running that docker job, but they are processing data in @gideonite's directory.
I will look in a moment. I have been on the road all morning.
I was running a docker container in an active session on gpu-2-14 but the process should have stopped using GPU resources sometime yesterday evening. Perhaps nvidia-smi
is reporting memory which is "allocated but collectible," though I'm not sure that makes sense or is a valid state to be in.
I requested the node by running qsub -I -l nodes=1:gpus=1:gtxtitan:docker:shared -q active
. Should I have done something different? @jchodera
I have a dim memory of seeing this before where without the "nodes" stanza the qsub does not do what is expected. I would need to locate the Git issue or Torque ticket that matches that part of my memory. As for the docker item, I'd have to investigate that as well if you feel the state of the card is wrong.
The docker flag seems to be working fine!
For the gpu-2-14 docker and nvidia GPU resources item I show via lsof that these processes appear to still have nvidia devices open.
ipython 23613 root mem REG 0,32 35460148 /dev/nvidia2 (path dev=0,5, inode=21968)
ipython 23613 root mem REG 0,32 35460147 /dev/nvidia1 (path dev=0,5, inode=21673)
ipython 23613 root mem REG 0,32 35460146 /dev/nvidia0 (path dev=0,5, inode=21282)
ipython 23613 root mem REG 0,32 35460149 /dev/nvidia3 (path dev=0,5, inode=21979)
Those are docker processes (note the root part) and an attempt to narrow down a bit besides being the only docker job on the system is that the cwd is showing:
/proc/23613/cwd -> /tf_data
Thats within the chroot.
And also if we expand that docker instance a bit we see that ipyton is associated with that one:
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b9d872ee4a3a b.gcr.io/tensorflow/tensorflow:latest-devel-gpu "/bin/bash" 2 days ago Up 2 days 6006/tcp, 8888/tcp thirsty_tesla
# docker top b9d872ee4a3a
UID PID PPID C STIME TTY TIME CMD
root 18030 5210 0 Jun06 pts/1 00:00:00 /bin/bash
root 23613 18030 28 Jun07 pts/1 14:20:17 /usr/bin/python /usr/local/bin/ipython
So I guess the question this is "why isn't this correct when you've both I believe requested "shared" mode for the nvidia cards". Or do I misunderstand that aside?
@corcra please note this is NOT related to your item.
I have reproduced as you folks have leaving off "nodes=X" appears to result in the behavior. Now trying to remember where I remember this from.
Things pointing at /tf_data
are mine; I currently have two docker jobs running... is this complicating matters?
No, I don't think its complicating things. I think basically the syntax you've used at the start of this doesn't work properly and we've talked about it before. I'm just trying to locate that conversation.
Ah, we may have noted something similar in #275 and Adaptive assigned me a bug number after confirming a resource parsing error. Let me see if I can spot anything on that. I don't recall ever seeing that bug being fixed.
They believe its basically the same bug and that nodes=X is required at this time in that release. He is however checking where the bug number went for the developers to fix the one I reported many moons ago. As it seems to have fallen out of existence.
My preference with "enforced syntax" is the parser should tell you its wrong and not just "do something random" ;) I know I'm weird that way.
This is confirmed back in their bug system but not addressed. Please use nodes=X in qsub resource requests.
I am confused by/failing to do
qsub
commands to get the correct resources. For example, I ran:qsub -I -q gpu -l gpus=4:gtxtitan:docker:shared
and got this setup: (gpu-1-5
fwiw)The GPUs aren't shared, and aren't gtxtitans ...what's going on here? I need both non-exclusive and gtxtitan (or >gtx680 at least) to run Tensorflow, so this is problematic.