cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

Cab we ask for specific GPU nodes? #371

Closed lzamparo closed 8 years ago

lzamparo commented 8 years ago

I'm encountering a strange error where a Torch job sometimes fails, and sometimes succeeds. The error message I get is:

/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/nn/THNN.lua:555: cuda runtime error (8) : invalid device function at /tmp/luarocks_cutorch-scm-1-9825/cutorch/lib/THC/THCTensorMath.cu:25 stack traceback: [C]: in function 'v' /root/torch/install/share/lua/5.1/nn/THNN.lua:555: in function 'SpatialConvolutionMM_updateOutput' /root/torch/install/share/lua/5.1/nn/SpatialConvolution.lua:100: in function 'updateOutput' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./convnet.lua:655: in function 'opfunc' /root/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop' ./convnet.lua:687: in function 'train_epoch' /root/Basset/src/basset_train.lua:149: in main chunk [C]: in function 'dofile' /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00406670

Since nothing in the docker image has changed from one run to another, and it's a CUDA runtime error, my hunch is this is a result of a cuda version mismatch between what's in my docker image and what's exposed by my mapping devices on the GPU compute node to the virtual devices in the image.

Here's the tracejob output from the working job:

[zamparol@mskcc-ln1 ~]$ tracejob 6871681 /var/spool/torque/mom_logs/20160203: No such file or directory /var/spool/torque/sched_logs/20160203: No such file or directory

Job: 6871681.mskcc-fe1.local

02/03/2016 05:36:03 S obit received - updating final job usage info 02/03/2016 05:36:03 S preparing to send 'a' mail for job 6871681.mskcc-fe1.local to zamparol@mskcc-ln1.local (Job exceeded its walltime limit. Job was aborted 02/03/2016 05:36:03 S Updated mailto from job owner: 'zamparol@mskcc-ln1.local' 02/03/2016 05:36:03 S job exit status -11 handled 02/03/2016 05:36:03 S Exit_status=-11 resources_used.cput=00:00:00 resources_used.energy_used=0 resources_used.mem=14148kb resources_used.vmem=367648kb resources_used.walltime=12:00:05 02/03/2016 05:36:03 S on_job_exit task assigned to job 02/03/2016 05:36:03 S req_jobobit completed 02/03/2016 05:36:03 A user=zamparol group=cllab jobname=myjob queue=gpu ctime=1454452539 qtime=1454452539 etime=1454452539 start=1454452558 owner=zamparol@mskcc-ln1.local exec_host=gpu-2-17/14 Resource_List.neednodes=1:ppn=1:gpus=1:docker Resource_List.nodect=1 Resource_List.nodes=1:ppn=1:gpus=1:docker Resource_List.walltime=12:00:00 >session=1854 total_execution_slots=1 unique_node_count=1 end=1454495763 Exit_status=-11 resources_used.cput=00:00:00 >resources_used.energy_used=0 resources_used.mem=14148kb resources_used.vmem=367648kb resources_used.walltime=12:00:05

And here's the tracejob output from the failing job:

[zamparol@mskcc-ln1 ~]$ tracejob 6871701 /var/spool/torque/mom_logs/20160203: No such file or directory /var/spool/torque/sched_logs/20160203: No such file or directory

Job: 6871701.mskcc-fe1.local

02/03/2016 14:18:38 A queue=gpu 02/03/2016 14:18:43 S child reported success for job after 0 seconds (dest=???), rc=0 02/03/2016 14:18:43 S preparing to send 'b' mail for job 6871701.mskcc-fe1.local to zamparol@mskcc-ln1.local (---) 02/03/2016 14:18:43 A user=zamparol group=cllab jobname=myjob queue=gpu ctime=1454527118 qtime=1454527118 etime=1454527118 start=1454527123 owner=zamparol@mskcc-ln1.local exec_host=gpu-1-11/25 Resource_List.neednodes=1:ppn=1:gpus=1:docker Resource_List.nodect=1 Resource_List.nodes=1:ppn=1:gpus=1:docker Resource_List.walltime=12:00:00 02/03/2016 14:18:47 S obit received - updating final job usage info 02/03/2016 14:18:47 S job exit status 1 handled 02/03/2016 14:18:47 S Exit_status=1 resources_used.cput=00:00:00 resources_used.energy_used=0 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:04 02/03/2016 14:18:47 S preparing to send 'e' mail for job 6871701.mskcc-fe1.local to zamparol@mskcc-ln1.local (Exit_status=1 02/03/2016 14:18:47 S on_job_exit task assigned to job 02/03/2016 14:18:47 S req_jobobit completed 02/03/2016 14:18:47 A user=zamparol group=cllab jobname=myjob queue=gpu ctime=1454527118 qtime=1454527118 etime=1454527118 start=1454527123 owner=zamparol@mskcc-ln1.local exec_host=gpu-1-11/25 Resource_List.neednodes=1:ppn=1:gpus=1:docker Resource_List.nodect=1 Resource_List.nodes=1:ppn=1:gpus=1:docker Resource_List.walltime=12:00:00 >session=30580 total_execution_slots=1 unique_node_count=1 end=1454527127 Exit_status=1 resources_used.cput=00:00:00 >resources_used.energy_used=0 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:04

So, when submitting a GPU job, can we ask for specific types of GPUs? And is there a difference between what's in gpu-2-17/14 and gpu-1-11/25?

Thanks for any light you can shine on this,

L.

tatarsky commented 8 years ago

Short answer: yep!

The wiki contains some references to what you seek. I think its pretty current.

https://github.com/cBio/cbio-cluster/wiki/MSKCC-cBio-Cluster-User-Guide#selecting-specific-gpu-types

And to determine what systems have what GPU types you can use the very hard to read pbsnodes and grep for the attribute or frankly just ssh to the node and type nvidia-smi -L

For example gpu-2-17 has GTX Titans (gtxtitan) gpu-1-11 has GTX 680 (gtx680)

tatarsky commented 8 years ago

BTW, are you sure you didn't just not use enough walltime?

`Job exceeded its walltime limit. Job was aborted```

lzamparo commented 8 years ago

BTW, are you sure you didn't just not use enough walltime?

For the job that 'succeeded', it ran through 10 epochs of training before the walltime expired, which is enough of a success for me; the model works. When I run it for real, I'll just ask for more walltime and use checkpoints to save incremental versions of my model.

For the job that failed, I used 0s of walltime:

02/03/2016 14:18:47 S Exit_status=1 resources_used.cput=00:00:00

So I think probably the GTX Titans vs the GTX 680 is more likely to be at the root of the problem. I'll try this job again, requesting a GTX Titan-bearing node (via your the nvidia-smi -L tip).

lzamparo commented 8 years ago

FWIW, grepping from pbsnodes was really useful:

pbsnodes | grep -B 4 gtxtitan | less returns the list of titan enabled nodes:

gpu-1-16 state = job-exclusive power_state = Running np = 32

properties = batch,gtxtitan,nv352,docker

gpu-2-5 state = free power_state = Running np = 32

properties = batch,gtxtitanx,nv352,docker

gpu-2-10 state = job-exclusive power_state = Running np = 32

properties = batch,gtxtitan,nv352,docker

gpu-2-11 state = free power_state = Running np = 32

properties = batch,gtxtitan,nv352,docker

gpu-2-12 state = job-exclusive power_state = Running np = 32

properties = batch,gtxtitan,nv352,docker

gpu-2-13 state = job-exclusive power_state = Running np = 32

properties = batch,gtxtitan,nv352,docker

gpu-2-14 state = job-exclusive power_state = Running np = 32

properties = batch,gtxtitan,nv352,docker

gpu-2-15 state = job-exclusive power_state = Running np = 32

properties = batch,gtxtitan,nv352,docker

gpu-2-16 state = free power_state = Running np = 32

properties = batch,gtxtitan,nv352,nv352,docker

gpu-2-17 state = free power_state = Running np = 32 properties = batch,gtxtitan,nv352,docker

tatarsky commented 8 years ago

Yep. Those come right out the nodes file and I've been the one adding that "docker" one for many moons.

lzamparo commented 8 years ago

Ok, I've run two parallel experiments, one on a gtx680 node and another on a gtxtitan node. The titan node job seems to be running as expected, and the gtx680 job has failed in the same way as I've pasted above.

So, it's either a problem with generated cuda code living somewhere within the image that runs on a Titan but not GTX680, or some other difference between the two.

jchodera commented 8 years ago

GTX 680 is CUDA Compute Capability 3.0, while the Titan is 3.5: https://en.wikipedia.org/wiki/CUDA

But I thought the only difference was dynamic parallelism, so maybe it does have to do with CUDA code cached in the image somewhere?

lzamparo commented 8 years ago

Hrm. Probably some cached CUDA code in the image, I'll have to figure out how to purge it somehow.

tatarsky commented 8 years ago

I believe the original question here is answered (yes, you can select specific GPU nodes). If additional items are needed lets open a new issue specific to them. Closing this one.