cBio / cbio-cluster

MSKCC cBio cluster documentation
12 stars 2 forks source link

docker + gpus: cannot find /dev/nvidia-uvm #389

Closed lzamparo closed 8 years ago

lzamparo commented 8 years ago

I'm now experiencing problems launching docker + gpu based jobs, either via the gpu queue or active jobs.

When I try to run a docker gpu job thusly:

[zamparol@gpu-2-17 ~]$ cat $PBS_GPUFILE
gpu-2-17-gpu0
[zamparol@gpu-2-17 ~]$ devices="--device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0"
[zamparol@gpu-2-17 ~]$ docker run -i -t -v /cbio/cllab/nobackup/zamparol/Basset/data/:/root/Basset/data $devices lzamparo/basset:latest /bin/bash

I get rudely rebuked by the docker daemon, complaining it cannot find the /dev/nvidia-uvm device:

Error response from daemon: Cannot start container 23e6442e787ba326729d9fb8a3214f4f1f5dd200e31c78a891885e14aff91670: error gathering device information while adding custom device "/dev/nvidia-uvm": lstat /dev/nvidia-uvm: no such file or directory

Indeed, there is no such exposed device on the active node:

[zamparol@gpu-2-17 ~]$ ls /dev/nv
nvidia0    nvidia1    nvidia2    nvidia3    nvidia4    nvidia5    nvidia6    nvidia7    nvidiactl  nvram

Is this a result of the recent upgrade? My docker + gpu jobs always managed to succeed with this mapping prior to the upgrade, has the uvm module been discontinued in cuda 7.0? Should I just omit this setting?

tatarsky commented 8 years ago

Unsure as to cause and will look.

tatarsky commented 8 years ago

Some nodes have the device. Some do not. Trying to determine why that would be.

lzamparo commented 8 years ago

Ok, thanks. Will monitor this thread.

tatarsky commented 8 years ago

Well according to the CUDA manual those device files are supposed to appear on boot when the modules are loaded but they also provide a script for conditions that they do not.

Nothing changed in that regards but I do see cases where that mknod did not fire.

Can you verify on gpu-2-17 that my manual mknod works properly?

lzamparo commented 8 years ago
[zamparol@gpu-2-17 ~]$ ls -lh /dev/nvidia*
crw-rw-rw- 1 root root 242,   0 Mar 17 12:46 /dev/nvidia-uvm
crw-rw-rw- 1 root root 195,   0 Mar  6 11:14 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Mar  6 11:14 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Mar  6 11:14 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Mar  6 11:14 /dev/nvidia3
crw-rw-rw- 1 root root 195,   4 Mar  6 11:14 /dev/nvidia4
crw-rw-rw- 1 root root 195,   5 Mar  6 11:14 /dev/nvidia5
crw-rw-rw- 1 root root 195,   6 Mar  6 11:14 /dev/nvidia6
crw-rw-rw- 1 root root 195,   7 Mar  6 11:14 /dev/nvidia7
crw-rw-rw- 1 root root 195, 255 Mar  6 11:14 /dev/nvidiactl

It seems to. I'll try to run the job with the device line as I had it before.

tatarsky commented 8 years ago

Yes, I'm mostly wanting to verify that if I need to add something to the boot script for the nvidia driver that it actually WORKS with your code. This may have simply been something that was manually done many moons ago.

tatarsky commented 8 years ago

I have made the same manual fix on six other nodes. I do not know why it would be needed but will investigate.

lzamparo commented 8 years ago

So far it seems to be working, my job has loaded a network and is computing as expected (I hope)

tatarsky commented 8 years ago

OK. Well let me know. I have a stanza I see is different in the CUDA manual from what is SDSC had in /etc/init.d/nvidia which I will add once you express success or failure. It appears the nvidia-uvm module may require this manual step which does not appear anywhere on the systems boot process.

/sbin/modprobe nvidia-uvm
if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi
lzamparo commented 8 years ago

I'm unable to report success or failure, given that my job ran over time in the active session and was killed. I also can't access the stdout, stderr produced using qpeek (or qpeek -e). I'm trying to slim down the computation on a reduced data set size, just so I can verify it's running properly, and will report once I've got something definitive.

tatarsky commented 8 years ago

Fair enough.

tatarsky commented 8 years ago

Before I call it a day shortly how you are doing @lzamparo ?

lzamparo commented 8 years ago

Still so far so good. Running interactive jobs on a number of different machines, and my results look sensible. So, I feel tentatively confident to report 'success'.

tatarsky commented 8 years ago

Sounds good. I will modify the boot script as noted next week. We'll leave this open to make sure I don't forget.

tatarsky commented 8 years ago

This has been added to the startup script. Believed resolved. Reopen if not.