cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

getting the wrong GPUs #422

Open corcra opened 8 years ago

corcra commented 8 years ago

I am confused by/failing to do qsub commands to get the correct resources. For example, I ran: qsub -I -q gpu -l gpus=4:gtxtitan:docker:shared and got this setup: (gpu-1-5 fwiw)

+------------------------------------------------------+
| NVIDIA-SMI 352.39     Driver Version: 352.39         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 680     Off  | 0000:03:00.0     N/A |                  N/A |
| 30%   32C    P8    N/A /  N/A |     48MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 680     Off  | 0000:04:00.0     N/A |                  N/A |
| 30%   31C    P8    N/A /  N/A |     48MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 680     Off  | 0000:83:00.0     N/A |                  N/A |
| 30%   31C    P8    N/A /  N/A |     48MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 680     Off  | 0000:84:00.0     N/A |                  N/A |
| 30%   30C    P8    N/A /  N/A |     48MiB /  4095MiB |     N/A    E. Thread |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
|    1                  Not Supported                                         |
|    2                  Not Supported                                         |
|    3                  Not Supported                                         |
+-----------------------------------------------------------------------------+

The GPUs aren't shared, and aren't gtxtitans ...what's going on here? I need both non-exclusive and gtxtitan (or >gtx680 at least) to run Tensorflow, so this is problematic.

nhgirija commented 8 years ago

Try the active queue.

qsub -I -q active -l walltime=01:00:00 -l nodes=1:ppn=1:gpus=4:shared:gtxtitans

jchodera commented 8 years ago

You shouldn't need to use the active queue---the constraints should still work. The active queue just has different priorities. Hm...

corcra commented 8 years ago

Using the active queue didn't fix it.

Although, I just managed to get 'good' (aka conforming to my request) gpus on gg06 and gg01. Included nodes=1:ppn=1 in the qsub call, although I don't see how that should be relevant...

jchodera commented 8 years ago

This worked correctly for me when I included nodes=1:ppn=4:

qsub -I -l walltime=04:00:00,nodes=1:ppn=4:gpus=4:shared:gtxtitan -l mem=4G -q gpu

I wonder why omitting the nodes=1:ppn=X gives incorrect resource requests...

jchodera commented 8 years ago

Oh, there are some problems with GPU spillover though. I was allocated gpu-2-14 and found the GPUs are tied up with something already:


[chodera@gpu-2-14 ~]$ nvidia-smi
Thu Jun  9 13:41:35 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 0000:03:00.0     Off |                  N/A |
| 30%   34C    P8    14W / 250W |   5872MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 0000:04:00.0     Off |                  N/A |
| 30%   33C    P8    14W / 250W |   5771MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TITAN   Off  | 0000:83:00.0     Off |                  N/A |
| 30%   34C    P8    14W / 250W |     85MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TITAN   Off  | 0000:84:00.0     Off |                  N/A |
| 30%   34C    P8    14W / 250W |     85MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     23613    C   /usr/bin/python                               5854MiB |
|    1      5220    C   /usr/bin/python                               5683MiB |
|    1     23613    C   /usr/bin/python                                 68MiB |
|    2     23613    C   /usr/bin/python                                 68MiB |
|    3     23613    C   /usr/bin/python                                 68MiB |
+-----------------------------------------------------------------------------+

This seems to be a docker job that is using GPUs but didn't request them, or is still running after supposedly being killed by torque:

1164      5107  0.0  0.0 153340  9356 pts/0    Sl+  Jun08   0:01 docker run -it -v /usr/lib64/libcuda.so:/usr/lib64/libcuda.so -v /usr/lib64/libcuda.so.1:/usr/lib64/libcuda.so.1 -v /usr/lib64/libcuda.so.352.39:/usr/lib64/libcuda.so.352.39 --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia1:/dev/nvidia1 --device /dev/nvidia2:/dev/nvidia2 --device /dev/nvidia3:/dev/nvidia3 --device /dev/nvidia4:/dev/nvidia4 --device /dev/nvidia5:/dev/nvidia5 --device /dev/nvidia6:/dev/nvidia6 --device /dev/nvidia7:/dev/nvidia7 --device /dev/nvidiactl:/dev/nvidiactl --env LD_LIBRARY_PATH=/opt/mpich2/gcc/eth/lib:/opt/gnu/gcc/4.8.1/lib64:/opt/gnu/gcc/4.8.1/lib:/opt/gnu/gmp/lib:/opt/gnu/mpc/lib:/opt/gnu/mpfr/lib:/usr/lib64/ --env CUDA_VISIBLE_DEVICES=1 -v /cbio/grlab/home/dresdnerg/software:/mnt/software -v /cbio/grlab/home/dresdnerg/projects/tissue-microarray-resnet/:/mnt/tma-resnet -it gmd:cudnn4 ipython
jchodera commented 8 years ago

It's impossible to tell who is/was running that docker job, but they are processing data in @gideonite's directory.

tatarsky commented 8 years ago

I will look in a moment. I have been on the road all morning.

gideonite commented 8 years ago

I was running a docker container in an active session on gpu-2-14 but the process should have stopped using GPU resources sometime yesterday evening. Perhaps nvidia-smi is reporting memory which is "allocated but collectible," though I'm not sure that makes sense or is a valid state to be in.

I requested the node by running qsub -I -l nodes=1:gpus=1:gtxtitan:docker:shared -q active. Should I have done something different? @jchodera

tatarsky commented 8 years ago

I have a dim memory of seeing this before where without the "nodes" stanza the qsub does not do what is expected. I would need to locate the Git issue or Torque ticket that matches that part of my memory. As for the docker item, I'd have to investigate that as well if you feel the state of the card is wrong.

corcra commented 8 years ago

The docker flag seems to be working fine!

tatarsky commented 8 years ago

For the gpu-2-14 docker and nvidia GPU resources item I show via lsof that these processes appear to still have nvidia devices open.

ipython   23613      root  mem       REG               0,32              35460148 /dev/nvidia2 (path dev=0,5, inode=21968)
ipython   23613      root  mem       REG               0,32              35460147 /dev/nvidia1 (path dev=0,5, inode=21673)
ipython   23613      root  mem       REG               0,32              35460146 /dev/nvidia0 (path dev=0,5, inode=21282)
ipython   23613      root  mem       REG               0,32              35460149 /dev/nvidia3 (path dev=0,5, inode=21979)

Those are docker processes (note the root part) and an attempt to narrow down a bit besides being the only docker job on the system is that the cwd is showing:

/proc/23613/cwd -> /tf_data

Thats within the chroot.

And also if we expand that docker instance a bit we see that ipyton is associated with that one:

# docker ps
CONTAINER ID        IMAGE                                             COMMAND             CREATED             STATUS              PORTS                NAMES
b9d872ee4a3a        b.gcr.io/tensorflow/tensorflow:latest-devel-gpu   "/bin/bash"         2 days ago          Up 2 days           6006/tcp, 8888/tcp   thirsty_tesla       

 # docker top b9d872ee4a3a
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                18030               5210                0                   Jun06               pts/1               00:00:00            /bin/bash
root                23613               18030               28                  Jun07               pts/1               14:20:17            /usr/bin/python /usr/local/bin/ipython

So I guess the question this is "why isn't this correct when you've both I believe requested "shared" mode for the nvidia cards". Or do I misunderstand that aside?

@corcra please note this is NOT related to your item.

tatarsky commented 8 years ago

I have reproduced as you folks have leaving off "nodes=X" appears to result in the behavior. Now trying to remember where I remember this from.

corcra commented 8 years ago

Things pointing at /tf_data are mine; I currently have two docker jobs running... is this complicating matters?

tatarsky commented 8 years ago

No, I don't think its complicating things. I think basically the syntax you've used at the start of this doesn't work properly and we've talked about it before. I'm just trying to locate that conversation.

tatarsky commented 8 years ago

Ah, we may have noted something similar in #275 and Adaptive assigned me a bug number after confirming a resource parsing error. Let me see if I can spot anything on that. I don't recall ever seeing that bug being fixed.

tatarsky commented 8 years ago

They believe its basically the same bug and that nodes=X is required at this time in that release. He is however checking where the bug number went for the developers to fix the one I reported many moons ago. As it seems to have fallen out of existence.

My preference with "enforced syntax" is the parser should tell you its wrong and not just "do something random" ;) I know I'm weird that way.

tatarsky commented 8 years ago

This is confirmed back in their bug system but not addressed. Please use nodes=X in qsub resource requests.