Closed lzamparo closed 8 years ago
Short answer: yep!
The wiki contains some references to what you seek. I think its pretty current.
https://github.com/cBio/cbio-cluster/wiki/MSKCC-cBio-Cluster-User-Guide#selecting-specific-gpu-types
And to determine what systems have what GPU types you can use the very hard to read pbsnodes
and grep for the attribute or frankly just ssh to the node and type nvidia-smi -L
For example gpu-2-17 has GTX Titans (gtxtitan
) gpu-1-11 has GTX 680 (gtx680
)
BTW, are you sure you didn't just not use enough walltime?
`Job exceeded its walltime limit. Job was aborted```
BTW, are you sure you didn't just not use enough walltime?
For the job that 'succeeded', it ran through 10 epochs of training before the walltime expired, which is enough of a success for me; the model works. When I run it for real, I'll just ask for more walltime and use checkpoints to save incremental versions of my model.
For the job that failed, I used 0s of walltime:
02/03/2016 14:18:47 S Exit_status=1 resources_used.cput=00:00:00
So I think probably the GTX Titans vs the GTX 680 is more likely to be at the root of the problem. I'll try this job again, requesting a GTX Titan-bearing node (via your the nvidia-smi -L
tip).
FWIW, grepping from pbsnodes
was really useful:
pbsnodes | grep -B 4 gtxtitan | less
returns the list of titan enabled nodes:
gpu-1-16 state = job-exclusive power_state = Running np = 32
properties = batch,gtxtitan,nv352,docker
gpu-2-5 state = free power_state = Running np = 32
properties = batch,gtxtitanx,nv352,docker
gpu-2-10 state = job-exclusive power_state = Running np = 32
properties = batch,gtxtitan,nv352,docker
gpu-2-11 state = free power_state = Running np = 32
properties = batch,gtxtitan,nv352,docker
gpu-2-12 state = job-exclusive power_state = Running np = 32
properties = batch,gtxtitan,nv352,docker
gpu-2-13 state = job-exclusive power_state = Running np = 32
properties = batch,gtxtitan,nv352,docker
gpu-2-14 state = job-exclusive power_state = Running np = 32
properties = batch,gtxtitan,nv352,docker
gpu-2-15 state = job-exclusive power_state = Running np = 32
properties = batch,gtxtitan,nv352,docker
gpu-2-16 state = free power_state = Running np = 32
properties = batch,gtxtitan,nv352,nv352,docker
gpu-2-17 state = free power_state = Running np = 32 properties = batch,gtxtitan,nv352,docker
Yep. Those come right out the nodes file and I've been the one adding that "docker" one for many moons.
Ok, I've run two parallel experiments, one on a gtx680
node and another on a gtxtitan
node. The titan node job seems to be running as expected, and the gtx680 job has failed in the same way as I've pasted above.
So, it's either a problem with generated cuda code living somewhere within the image that runs on a Titan but not GTX680, or some other difference between the two.
GTX 680 is CUDA Compute Capability 3.0, while the Titan is 3.5: https://en.wikipedia.org/wiki/CUDA
But I thought the only difference was dynamic parallelism, so maybe it does have to do with CUDA code cached in the image somewhere?
Hrm. Probably some cached CUDA code in the image, I'll have to figure out how to purge it somehow.
I believe the original question here is answered (yes, you can select specific GPU nodes). If additional items are needed lets open a new issue specific to them. Closing this one.
I'm encountering a strange error where a Torch job sometimes fails, and sometimes succeeds. The error message I get is:
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/nn/THNN.lua:555: cuda runtime error (8) : invalid device function at /tmp/luarocks_cutorch-scm-1-9825/cutorch/lib/THC/THCTensorMath.cu:25 stack traceback: [C]: in function 'v' /root/torch/install/share/lua/5.1/nn/THNN.lua:555: in function 'SpatialConvolutionMM_updateOutput' /root/torch/install/share/lua/5.1/nn/SpatialConvolution.lua:100: in function 'updateOutput' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./convnet.lua:655: in function 'opfunc' /root/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop' ./convnet.lua:687: in function 'train_epoch' /root/Basset/src/basset_train.lua:149: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670
Since nothing in the docker image has changed from one run to another, and it's a CUDA runtime error, my hunch is this is a result of a cuda version mismatch between what's in my docker image and what's exposed by my mapping devices on the GPU compute node to the virtual devices in the image.
Here's the tracejob output from the working job:
And here's the tracejob output from the failing job:
So, when submitting a GPU job, can we ask for specific types of GPUs? And is there a difference between what's in
gpu-2-17/14
andgpu-1-11/25
?Thanks for any light you can shine on this,
L.