Closed lzamparo closed 8 years ago
Unsure as to cause and will look.
Some nodes have the device. Some do not. Trying to determine why that would be.
Ok, thanks. Will monitor this thread.
Well according to the CUDA manual those device files are supposed to appear on boot when the modules are loaded but they also provide a script for conditions that they do not.
Nothing changed in that regards but I do see cases where that mknod did not fire.
Can you verify on gpu-2-17
that my manual mknod works properly?
[zamparol@gpu-2-17 ~]$ ls -lh /dev/nvidia*
crw-rw-rw- 1 root root 242, 0 Mar 17 12:46 /dev/nvidia-uvm
crw-rw-rw- 1 root root 195, 0 Mar 6 11:14 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Mar 6 11:14 /dev/nvidia1
crw-rw-rw- 1 root root 195, 2 Mar 6 11:14 /dev/nvidia2
crw-rw-rw- 1 root root 195, 3 Mar 6 11:14 /dev/nvidia3
crw-rw-rw- 1 root root 195, 4 Mar 6 11:14 /dev/nvidia4
crw-rw-rw- 1 root root 195, 5 Mar 6 11:14 /dev/nvidia5
crw-rw-rw- 1 root root 195, 6 Mar 6 11:14 /dev/nvidia6
crw-rw-rw- 1 root root 195, 7 Mar 6 11:14 /dev/nvidia7
crw-rw-rw- 1 root root 195, 255 Mar 6 11:14 /dev/nvidiactl
It seems to. I'll try to run the job with the device line as I had it before.
Yes, I'm mostly wanting to verify that if I need to add something to the boot script for the nvidia driver that it actually WORKS with your code. This may have simply been something that was manually done many moons ago.
I have made the same manual fix on six other nodes. I do not know why it would be needed but will investigate.
So far it seems to be working, my job has loaded a network and is computing as expected (I hope)
OK. Well let me know. I have a stanza I see is different in the CUDA manual from what is SDSC had in /etc/init.d/nvidia which I will add once you express success or failure. It appears the nvidia-uvm module may require this manual step which does not appear anywhere on the systems boot process.
/sbin/modprobe nvidia-uvm
if [ "$?" -eq 0 ]; then
# Find out the major device number used by the nvidia-uvm driver
D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
mknod -m 666 /dev/nvidia-uvm c $D 0
else
exit 1
fi
I'm unable to report success or failure, given that my job ran over time in the active session and was killed. I also can't access the stdout, stderr produced using qpeek (or qpeek -e). I'm trying to slim down the computation on a reduced data set size, just so I can verify it's running properly, and will report once I've got something definitive.
Fair enough.
Before I call it a day shortly how you are doing @lzamparo ?
Still so far so good. Running interactive jobs on a number of different machines, and my results look sensible. So, I feel tentatively confident to report 'success'.
Sounds good. I will modify the boot script as noted next week. We'll leave this open to make sure I don't forget.
This has been added to the startup script. Believed resolved. Reopen if not.
I'm now experiencing problems launching docker + gpu based jobs, either via the gpu queue or active jobs.
When I try to run a docker gpu job thusly:
I get rudely rebuked by the docker daemon, complaining it cannot find the /dev/nvidia-uvm device:
Indeed, there is no such exposed device on the active node:
Is this a result of the recent upgrade? My docker + gpu jobs always managed to succeed with this mapping prior to the upgrade, has the uvm module been discontinued in cuda 7.0? Should I just omit this setting?