Closed ruffsl closed 8 years ago
Don't worry, we are already working on the plugin :)
I figured, but you know I couldn't help but ask :P
It's being worked on as we speak :) I should have a working implementation fairly soon for you to play with.
The wrapper script was my biggest concern with this project, but the Docker plugin sounds like the ideal solution. Once this is ready (alongside https://github.com/NVIDIA/nvidia-docker/issues/7 and hopefully https://github.com/NVIDIA/nvidia-docker/issues/10) I'll be happy to port over the many DL images built on top of kaixhin/cuda
. I'll keep old versions around for legacy purposes but it'll be good to have CUDA on Docker looked after officially.
I just pushed an initial implementation of the plugin in the v1
branch.
This certainly needs additional work but people can start experimenting with it now.
The plugin has two REST endpoints that one can query to get GPU information:
localhost:3476/gpu/info
localhost:3476/gpu/status
In addition it provides /docker/cli
which generates proper Docker arguments given volume names and device numbers ( it will probably be called from within nvidia-docker
).
Example of running CUDA runtime with two GPUs 0 and 1:
make runtime
cd plugin && make
sudo sh -c "./bin/nvidia-docker-plugin &"
gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1\&vol=bin+cuda; }
docker run -ti `gpu 0+1` cuda:runtime
@3XX0 , I've tried running the plugin using the snippet you posted above, but It looks like I'm having some issues with the ld.so.cache
file:
~/git/NVIDIA/nvidia-docker/plugin$ sudo sh -c "./bin/plugin &"
nvidia-docker-plugin | 2015/12/08 11:29:07 Loading NVIDIA management library
nvidia-docker-plugin | 2015/12/08 11:29:07 Loading NVIDIA unified memory module
nvidia-docker-plugin | 2015/12/08 11:29:07 Discovering GPU devices
nvidia-docker-plugin | 2015/12/08 11:29:07 Creating volumes
nvidia-docker-plugin | 2015/12/08 11:29:07 Error: invalid ld.so.cache file
I've reproduced the same error on two different systems: GPU Docker Plugin Debuging
Let me know of any more specifics or logs you'd need.
Weird, can you give the output of
strings /etc/ld.so.cache | head -n 2
hexdump -C -n 256 /etc/ld.so.cache
hexdump -C /etc/ld.so.cache | grep -A2 glibc
@3XX0 , I've amended my gist above with the added output.
My bad ... Thanks for the report, it should be fixed now
Success using the new nvidia-docker-plugin executable!
$ sudo sh -c "./bin/nvidia-docker-plugin &"
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Loading NVIDIA management library
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Loading NVIDIA unified memory
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Discovering GPU devices
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Creating volumes at /tmp/nvidia-volumes-355599703
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Serving plugin API at /run/docker/plugins
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Serving remote API at localhost:3476
$ gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1;vol=bin+cuda; }
$ docker run -ti `gpu 0` cuda:runtime
root@f4a4da5d68b1:/# nvidia-smi
Tue Dec 8 20:28:20 2015
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:01:00.0 On | N/A |
| 22% 38C P8 17W / 250W | 659MiB / 12287MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
So correct me in what I see going on so far:
vol=bin+cuda;
doesn't seem to modify the REST returned string)I like the REST endpoints, it's kind of handy alone just to be able to point a browser at x.x.x.x:3476/gpu/status
to check on GPU usage. How would I use this remotely, i.e. when I can't rely on the gpu 0+1
to execute the nested shell command on the remote host (like docker-machine).
The query string separator is wrong, that's why the vol
argument is not taken into account. It should be:
gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1\&vol=bin+cuda; }
@ruffsl Correct, to use it remotely you would do something like:
# On the docker-machine host
sudo ./bin/nvidia-docker-plugin -l :3476
# On the docker client
gpu(){
host="$( docker-machine url $1 | sed 's;tcp://\(.*\):[0-9]\+;http://\1:3476;' );"
curl -s $host/docker/cli?dev=$2\&vol=bin+cuda;
}
eval "$(docker-machine env <MACHINE>)"
docker run -ti `gpu <MACHINE> 0+1` cuda:runtime
Note if your docker-machine
is backed up by a VM, you will need to enable GPU passthrough
Eventually everything should be abstracted withinnvidia-docker
so stay tuned.
@flx42 , what I'm seeing is that
gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1\&vol=bin+cuda; }
gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1; }
--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --volume-driver=nvidia --volume=bin:/usr/local/nvidia/bin --volume=cuda:/usr/local/nvidia
It's not likely that someone would want to use this without mounting the bin+nvidia, but I was thinking it'd behave as a parameter (omit it, and it won't included)?
@3XX0 , wouldn't that assume you'd need to expose port 3476 of the remote machine to the world? I'f I'm recalling correctly, docker-machine runs via daemon binding to a TCP port via key exchange and then some ssh. How would the request reach the remote REST endpoint from the local client's shell session?
Yes, if you don't specify vol
, by default it will take all volume. But try vol=bin
, it should be different.
I see, that does work. Is there a way to specify none
?
@ruffsl no we didn't implement none
because it doesn't really make sense to ask no volumes.
Regarding the REST API, if you want remote access, you need to expose it.
It has been question to add SSL to handle unsafe environments but in practice you rarely need it.
You can always tunnel it through ssh
and in fact, I was thinking about adding a similar feature in the future nvidia-docker
.
I'm not really sure how docker-machine
works but from my understanding they were using the DOCKER_HOST
hence your Docker daemon needs to be exposed as well (I might be wrong though)
Well, it'd be tedious for people to override this while still leveraging the device detection and mounting the rest of the plugin machinery here has to offer. Niche I know, but this would be useful for those who'd like to use set nvidia devices, but needn't use cuda. Remember people like me still need to bake the drivers into the container for some apps to get things like OpenGL working, I think these volumes may blow aways some files we'd need to preserve during runtime in that scenario.
Yea, regarding remote access, I feel like there might be a better way to get about this. I'm wondering if there would be something better than just port forwarding with the docker-machine-ssh. Let's ask @psftw , maybe he'd know about this topic or know who to ask.
I don't get it, why would you want no volumes ? GPU devices are unusable without at least one NVIDIA volume and if you really need it then you don't need to use the /docker/cli
endpoint, just use --device
directly.
Speaking of which, I'm wondering if the current volume separation (aka. cuda
, bin
...) is worth the trouble. I might change it to a single driver
volume except if there is a reason not to do so.
Is it now possible to use docker-compose along with nvidia-docker? If so, how?
I too would be interesting in using docker-compose along with nvidia-docker
See #39 ;)
I've been looking at ways to use cuda containers at my workplace, as our lab shares a common Nvidia workstation, and I'd like interact with this server in a more abstract manner so that 1) I can more readily port my robotics work to any nvidia workstation, and 2) minimize the impact of changes effecting others using shared research workstation.
One gap I'm wrestling with is how to incorporate the current NVIDIA Docker wrapper with the rest of the existing docker ecosystem: docker compose, machine, and swarm. The current drop-in replacement for docker run|create CLI is awesome, but it only gets us so far. The moment we need to use any additional tooling for abstracting or scaling up our apps or avoiding the need to interact with the host directly, well its hard to get to that last step.
So I'm thinking this might be a case for making relevant docker plugin, harkening back to a recent post on the Docker blog, Extending Docker with Plugins. That post was perhaps geared more towards networking and storage drivers, but perhaps our issue here could be treated as a custom volume management. I feel the same level of integration of GPU device options may be called for to achieve the desired user experience in cloud development or cluster computing with Nvidia. This'll most likely call for something more demanding than shell scripts to extend the needed interfaces, so I'd like to hear the rest of the community's and Nvidia devs take on this.