NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.27k stars 2.03k forks source link

GPU Docker Plugin #8

Closed ruffsl closed 8 years ago

ruffsl commented 9 years ago

I've been looking at ways to use cuda containers at my workplace, as our lab shares a common Nvidia workstation, and I'd like interact with this server in a more abstract manner so that 1) I can more readily port my robotics work to any nvidia workstation, and 2) minimize the impact of changes effecting others using shared research workstation.

One gap I'm wrestling with is how to incorporate the current NVIDIA Docker wrapper with the rest of the existing docker ecosystem: docker compose, machine, and swarm. The current drop-in replacement for docker run|create CLI is awesome, but it only gets us so far. The moment we need to use any additional tooling for abstracting or scaling up our apps or avoiding the need to interact with the host directly, well its hard to get to that last step.

So I'm thinking this might be a case for making relevant docker plugin, harkening back to a recent post on the Docker blog, Extending Docker with Plugins. That post was perhaps geared more towards networking and storage drivers, but perhaps our issue here could be treated as a custom volume management. I feel the same level of integration of GPU device options may be called for to achieve the desired user experience in cloud development or cluster computing with Nvidia. This'll most likely call for something more demanding than shell scripts to extend the needed interfaces, so I'd like to hear the rest of the community's and Nvidia devs take on this.

flx42 commented 9 years ago

Don't worry, we are already working on the plugin :)

ruffsl commented 9 years ago

I figured, but you know I couldn't help but ask :P

3XX0 commented 9 years ago

It's being worked on as we speak :) I should have a working implementation fairly soon for you to play with.

Kaixhin commented 9 years ago

The wrapper script was my biggest concern with this project, but the Docker plugin sounds like the ideal solution. Once this is ready (alongside https://github.com/NVIDIA/nvidia-docker/issues/7 and hopefully https://github.com/NVIDIA/nvidia-docker/issues/10) I'll be happy to port over the many DL images built on top of kaixhin/cuda. I'll keep old versions around for legacy purposes but it'll be good to have CUDA on Docker looked after officially.

3XX0 commented 8 years ago

I just pushed an initial implementation of the plugin in the v1 branch. This certainly needs additional work but people can start experimenting with it now. The plugin has two REST endpoints that one can query to get GPU information:

In addition it provides /docker/cli which generates proper Docker arguments given volume names and device numbers ( it will probably be called from within nvidia-docker).

Example of running CUDA runtime with two GPUs 0 and 1:

make runtime
cd plugin && make
sudo sh -c "./bin/nvidia-docker-plugin &"

gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1\&vol=bin+cuda; }
docker run -ti `gpu 0+1` cuda:runtime
ruffsl commented 8 years ago

@3XX0 , I've tried running the plugin using the snippet you posted above, but It looks like I'm having some issues with the ld.so.cache file:

~/git/NVIDIA/nvidia-docker/plugin$ sudo sh -c "./bin/plugin &"
nvidia-docker-plugin | 2015/12/08 11:29:07 Loading NVIDIA management library
nvidia-docker-plugin | 2015/12/08 11:29:07 Loading NVIDIA unified memory module
nvidia-docker-plugin | 2015/12/08 11:29:07 Discovering GPU devices
nvidia-docker-plugin | 2015/12/08 11:29:07 Creating volumes
nvidia-docker-plugin | 2015/12/08 11:29:07 Error: invalid ld.so.cache file

I've reproduced the same error on two different systems: GPU Docker Plugin Debuging

Let me know of any more specifics or logs you'd need.

3XX0 commented 8 years ago

Weird, can you give the output of

strings /etc/ld.so.cache | head -n 2
hexdump -C -n 256  /etc/ld.so.cache
hexdump -C /etc/ld.so.cache | grep -A2 glibc
ruffsl commented 8 years ago

@3XX0 , I've amended my gist above with the added output.

3XX0 commented 8 years ago

My bad ... Thanks for the report, it should be fixed now

ruffsl commented 8 years ago

Success using the new nvidia-docker-plugin executable!

$ sudo sh -c "./bin/nvidia-docker-plugin &"
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Loading NVIDIA management library
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Loading NVIDIA unified memory
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Discovering GPU devices
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Creating volumes at /tmp/nvidia-volumes-355599703
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Serving plugin API at /run/docker/plugins
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Serving remote API at localhost:3476

$ gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1;vol=bin+cuda; }

$ docker run -ti `gpu 0` cuda:runtime

root@f4a4da5d68b1:/# nvidia-smi 
Tue Dec  8 20:28:20 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:01:00.0      On |                  N/A |
| 22%   38C    P8    17W / 250W |    659MiB / 12287MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

So correct me in what I see going on so far:

I like the REST endpoints, it's kind of handy alone just to be able to point a browser at x.x.x.x:3476/gpu/status to check on GPU usage. How would I use this remotely, i.e. when I can't rely on the gpu 0+1 to execute the nested shell command on the remote host (like docker-machine).

flx42 commented 8 years ago

The query string separator is wrong, that's why the vol argument is not taken into account. It should be:

gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1\&vol=bin+cuda; }
3XX0 commented 8 years ago

@ruffsl Correct, to use it remotely you would do something like:

# On the docker-machine host
sudo ./bin/nvidia-docker-plugin -l :3476

# On the docker client
gpu(){
  host="$( docker-machine url $1 | sed 's;tcp://\(.*\):[0-9]\+;http://\1:3476;' );"
  curl -s $host/docker/cli?dev=$2\&vol=bin+cuda;
}

eval "$(docker-machine env <MACHINE>)"
docker run -ti `gpu <MACHINE> 0+1` cuda:runtime

Note if your docker-machine is backed up by a VM, you will need to enable GPU passthrough

Eventually everything should be abstracted withinnvidia-docker so stay tuned.

ruffsl commented 8 years ago

@flx42 , what I'm seeing is that

--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --volume-driver=nvidia --volume=bin:/usr/local/nvidia/bin --volume=cuda:/usr/local/nvidia 

It's not likely that someone would want to use this without mounting the bin+nvidia, but I was thinking it'd behave as a parameter (omit it, and it won't included)?

@3XX0 , wouldn't that assume you'd need to expose port 3476 of the remote machine to the world? I'f I'm recalling correctly, docker-machine runs via daemon binding to a TCP port via key exchange and then some ssh. How would the request reach the remote REST endpoint from the local client's shell session?

flx42 commented 8 years ago

Yes, if you don't specify vol, by default it will take all volume. But try vol=bin, it should be different.

ruffsl commented 8 years ago

I see, that does work. Is there a way to specify none?

3XX0 commented 8 years ago

@ruffsl no we didn't implement none because it doesn't really make sense to ask no volumes.

Regarding the REST API, if you want remote access, you need to expose it. It has been question to add SSL to handle unsafe environments but in practice you rarely need it. You can always tunnel it through ssh and in fact, I was thinking about adding a similar feature in the future nvidia-docker. I'm not really sure how docker-machine works but from my understanding they were using the DOCKER_HOST hence your Docker daemon needs to be exposed as well (I might be wrong though)

ruffsl commented 8 years ago

Well, it'd be tedious for people to override this while still leveraging the device detection and mounting the rest of the plugin machinery here has to offer. Niche I know, but this would be useful for those who'd like to use set nvidia devices, but needn't use cuda. Remember people like me still need to bake the drivers into the container for some apps to get things like OpenGL working, I think these volumes may blow aways some files we'd need to preserve during runtime in that scenario.

Yea, regarding remote access, I feel like there might be a better way to get about this. I'm wondering if there would be something better than just port forwarding with the docker-machine-ssh. Let's ask @psftw , maybe he'd know about this topic or know who to ask.

3XX0 commented 8 years ago

I don't get it, why would you want no volumes ? GPU devices are unusable without at least one NVIDIA volume and if you really need it then you don't need to use the /docker/cli endpoint, just use --device directly.

Speaking of which, I'm wondering if the current volume separation (aka. cuda, bin...) is worth the trouble. I might change it to a single driver volume except if there is a reason not to do so.

cancan101 commented 8 years ago

Is it now possible to use docker-compose along with nvidia-docker? If so, how?

lsb commented 8 years ago

I too would be interesting in using docker-compose along with nvidia-docker

3XX0 commented 8 years ago

See #39 ;)