docker / compose

Define and run multi-container applications with Docker
https://docs.docker.com/compose/
Apache License 2.0
33.68k stars 5.19k forks source link

Support for NVIDIA GPUs under Docker Compose #6691

Closed collabnix closed 3 years ago

collabnix commented 5 years ago

Under Docker 19.03.0 Beta 2, support for NVIDIA GPU has been introduced in the form of new CLI API --gpus. https://github.com/docker/cli/pull/1714 talk about this enablement.

Now one can simply pass --gpus option for GPU-accelerated Docker based application.

$ docker run -it --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
f476d66f5408: Pull complete 
8882c27f669e: Pull complete 
d9af21273955: Pull complete 
f5029279ec12: Pull complete 
Digest: sha256:d26d529daa4d8567167181d9d569f2a85da3c5ecaf539cace2c6223355d69981
Status: Downloaded newer image for ubuntu:latest
Tue May  7 15:52:15 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116                Driver Version: 390.116                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    22W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
:~$ 

As of today, Compose doesn't support this. This is a feature request for enabling Compose to support for NVIDIA GPU.

AgingChan commented 4 years ago

To be frank, this maybe not the best practise, but somehow we make it work.

The tricky part is that we have to stick with docker-compose v3.x since we are use docker swarm, meanwhile we want to use the Nvidia Runtime to support GPU/CUDA in the containers.

To avoid explicitly telling the Nvidia Runtime inside the docker-compose file, we set the Nvidia as the default runtime in /etc/docker/daemon.json, and it will looks like

{
    "default-runtime":"nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Such that all the containers running on the GPU machines will default enable the Nvidia runtime.

Hope this can help someone facing the similar blocker

DoktorMike commented 4 years ago

To be frank, this maybe not the best practise, but somehow we make it work.

The tricky part is that we have to stick with docker-compose v3.x since we are use docker swarm, meanwhile we want to use the Nvidia Runtime to support GPU/CUDA in the containers.

To avoid explicitly telling the Nvidia Runtime inside the docker-compose file, we set the Nvidia as the default runtime in /etc/docker/daemon.json, and it will looks like

{
    "default-runtime":"nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Such that all the containers running on the GPU machines will default enable the Nvidia runtime.

Hope this can help someone facing the similar blocker

This is indeed what we do as well. It works for now, but it feels a little hacky to me. Hoping for full compose-v3 support soon. :)

opptimus commented 4 years ago

Is it intended to have user manually populate /etc/docker/daemon.json after migrating to docker >= 19.03 and removing nvidia-docker2 to use nvidia-container-toolkit instead?

It seems that this breaks a lot of installations. Especially, since --gpus is not available in compose.

--gpus is not available in compose I can not use pycharm to link docker to run tensorflow-gpu

qraleq commented 4 years ago

Any updates on this issue? Is there a chance that the --gpus will be supported in docker-compose soon?

yoanisgil commented 4 years ago

For those of you looking for a workaround this what we ended up doing:

And then run COMPOSE_API_VERSION=auto docker-compose run gpu with the following file:

version: '3.7'

services:
    gpu:
        image: 'nvidia/cuda:9.0-base'
        command: 'nvidia-smi'
        device_requests:
            - capabilities:
               - "gpu"
Briwisdom commented 4 years ago

Under Docker 19.03.0 Beta 2, support for NVIDIA GPU has been introduced in the form of new CLI API --gpus. docker/cli#1714 talk about this enablement.

Now one can simply pass --gpus option for GPU-accelerated Docker based application.

$ docker run -it --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
f476d66f5408: Pull complete 
8882c27f669e: Pull complete 
d9af21273955: Pull complete 
f5029279ec12: Pull complete 
Digest: sha256:d26d529daa4d8567167181d9d569f2a85da3c5ecaf539cace2c6223355d69981
Status: Downloaded newer image for ubuntu:latest
Tue May  7 15:52:15 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116                Driver Version: 390.116                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    22W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
:~$ 

As of today, Compose doesn't support this. This is a feature request for enabling Compose to support for NVIDIA GPU.

I have solved this problems,you can have a try as follows, my csdn blog address: https://blog.csdn.net/u010420283/article/details/104055046

~$ sudo apt-get install nvidia-container-runtime ~$ sudo vim /etc/docker/daemon.json

then , in this daemon.json file, add this content:

{ "default-runtime": "nvidia" "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }

~$ sudo systemctl daemon-reload ~$ sudo systemctl restart docker

ggregoire commented 4 years ago

For the ansible users who want to setup the workaround described before, there is a role to install nvidia-container-runtime and configure the /etc/docker/deamon.json to use runtime: nvidia:

https://github.com/NVIDIA/ansible-role-nvidia-docker

(for some reason it runs only on Ubuntu and RHEL, but it's quite easy to modify. I run it on Debian)

Then in your docker-compose.yml:

version: "2.4"
services:
  test:
    image: "nvidia/cuda:10.2-runtime-ubuntu18.04"
    command: "nvidia-smi"
dottgonzo commented 4 years ago

any update on official 3.x version with gpu support? We need on swarm :)

GuillemGSubies commented 4 years ago

Is there any plan to add this feature?

Lucidiot commented 4 years ago

This feature depends on docker-py implementing the device_requests parameters, which is what --gpus translates to. There have been multiple pull requests to add this feature (https://github.com/docker/docker-py/pull/2419, https://github.com/docker/docker-py/pull/2465, https://github.com/docker/docker-py/pull/2471) but there are no reactions from any maintainer. #7124 uses https://github.com/docker/docker-py/pull/2471 to provide it in Compose, but still no reply from anyone.

yoanisgil commented 4 years ago

As I mentioned in #7124 I'm more than happy to make the PR more compliant but since it's gotten very little attention I don't want to waste my time in something that's not going to be merged ...

BruneXX commented 4 years ago

Please add this feature, will be awesome!

wilderrodrigues commented 4 years ago

Please, add this feature! I was more than happy with the old nevidia-docker2, which allowed me to change the runtime in the daemon.json. Would be extremely nice to have this back.

sebastianfelipe commented 4 years ago

Need it, please. Really need it :/

digitaldavenyc commented 4 years ago

I'd like to pile on as well... we need this feature!

vk1z commented 4 years ago

I need to run both CPU and GPU containers on the same machine so the default runtime hack doesn't work for me. Do we have any idea when this will work on compose? Given that that we don't have the runtime flag in compose this represents a serious functionality regression, does it not? I'm having to write scripts in order to make this work - yuck!

dottgonzo commented 4 years ago

I need to run both CPU and GPU containers on the same machine so the default runtime hack doesn't work for me. Do we have any idea when this will work on compose? Given that that we don't have the runtime flag in compose this represents a serious functionality regression, does it not? I'm having to write scripts in order to make this work - yuck!

you can do it by docker cli (docker run --gpu ....), i have this kind of trick (by adding a proxy, to be able to communicato with other containers running on other nodes on swarm). We are all waiting for the ability to run it on swarm, because it don't work by docker service command (as i know) nor by compose.

vk1z commented 4 years ago

@dottgonzo . Well, yes ;-). I am aware of this and hence the reference to scripts. But this is a pretty awful and non-portable way of doing it so I'd like to do it in a more dynamic way. As I said, I think that this represents a regression, not a feature ask.

daddydrac commented 4 years ago

COMPOSE_API_VERSION=auto docker-compose run gpu

@ggregoire where do we run: COMPOSE_API_VERSION=auto docker-compose run gpu ?

yoanisgil commented 4 years ago

@joehoeller from your shell just was you would do for any other command.

Mithrandir2k18 commented 4 years ago

Right now we are deciding for every project if we need 3.x features or if we can use docker-compose 2.x where the GPU option is still supported. Features like running multistage targets from a Dockerfile can sadly not be used if GPU is necessary. Please add this back in!

I'd like to recommend something like an "additional options" field for docker-compose where we can just add flags like --gpus=all to the docker start/run command, that are not yet/anymore supported in docker-compose but are in the latest docker version. This way, compose users won't have to wait for docker-compose to catch up if they need a new not yet supported docker feature.

sebastianfelipe commented 4 years ago

Is still necessary to run this on Docker Swarm for production environments. Will this be useful por Docker Swarm?

Mithrandir2k18 commented 4 years ago

@sebastianfelipe It's very useful if you want to deploy to your swarm using compose. Compare: docker service create --generic-resource "gpu=1" --replicas 10 \ --name sparkWorker <image_name> \"service ssh start && \ /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://<spark_master_ip>:7077\"

to something like this

docker stack deploy --compose-file docker-compose.yml stackdemo

sebastianfelipe commented 4 years ago

@sebastianfelipe It's very useful if you want to deploy to your swarm using compose. Compare: docker service create --generic-resource "gpu=1" --replicas 10 \ --name sparkWorker <image_name> \"service ssh start && \ /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://<spark_master_ip>:7077\"

to something like this

docker stack deploy --compose-file docker-compose.yml stackdemo

Sorry, so is it already working with Docker Swarm using the docker-compose yaml file? Just to be sure :O. Thanks!

Mithrandir2k18 commented 4 years ago

only for docker compose 2.x

The entire point of this issue is to request nvidia-docker gpu support for docker-compose 3+

miriaford commented 4 years ago

It's been almost a year since the original request!! Why the delay?? Can we move this forward ??

VanDavv commented 4 years ago

ping @KlaasH @ulyssessouza @Goryudyuma @chris-crone . Any update on this?

miriaford commented 4 years ago

For those of you looking for a workaround this what we ended up doing:

And then run COMPOSE_API_VERSION=auto docker-compose run gpu with the following file:

version: '3.7'

services:
    gpu:
        image: 'nvidia/cuda:9.0-base'
        command: 'nvidia-smi'
        device_requests:
            - capabilities:
               - "gpu"

For those of you who are as impatient as I am, here's an easy pip install version of the above workaround:

pip install git+https://github.com/docker/docker-py.git@refs/pull/2471/merge
pip install git+https://github.com/docker/compose.git@refs/pull/7124/merge
pip install python-dotenv

Huge kudos to @yoanisgil ! Still anxiously waiting for an official patch. With all the PRs in place, it doesn't seem difficult by any standard.

Goryudyuma commented 4 years ago

ping @KlaasH @ulyssessouza @Goryudyuma @chris-crone . Any update on this?

No, I don't know why I was called. I want you to tell me what to do?

ugurkanates commented 4 years ago

I hope there is an update on this.

aviallon commented 4 years ago

Yeah, it's been more than a year now... why are they not merging in docker-py...

chris-crone commented 4 years ago

I'm not sure that the proposed implementations are the right ones for the Compose format. The good news is that we've opened up the Compose format specification with the intention of adding things like this. You can find the spec at https://github.com/compose-spec.

What I'd suggest we do is add an issue on the spec and then discuss it at one of the upcoming Compose community meetings (link to invite at the bottom of this page).

deniswal commented 4 years ago

This works: docker run --gpus all nvidia/cudagl:9.2-runtime-centos7 nvidia-smi This does not: docker run --runtime=nvidia nvidia/cudagl:9.2-runtime-centos7 nvidia-smi

You need to have

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

in your /etc/docker/daemon.json for --runtime=nvidia to continue working. More info here.

Dockerd doesn't start with this daemon.json

ghost commented 4 years ago

Christ, this is going to take years :@

vk1z commented 4 years ago

This works: docker run --gpus all nvidia/cudagl:9.2-runtime-centos7 nvidia-smi @deniswal : Yes, we know this, but we are asking about compose functionality.

@chris-crone: I'm confused: This represents a regression from former behavior, why does it need a new feature specification? Isn't it reasonable to run containers, some of which use GPU and some of which use CPU on the same physical box?

Thanks for the consideration.

chris-crone commented 4 years ago

@vk1z AFAIK Docker Compose has never had GPU support so this is not a regression. The part that needs design is how to declare a service's need for a GPU (or other device) in the Compose format– specifically changes like this. After that, it should just be plumbing to the backend.

BruneXX commented 4 years ago

Hi Guys, I've tried some solutions proposed here and nothing worked to me, for example @miriaford do not worked in my case, also is there some way to use GPU to run my existent docker containers? I've an i7 with 16GB of ram but the build for some projects takes too long to complete, my goal is to also use GPU power to speed up the process, is that possible? Thanks!

vk1z commented 4 years ago

@chris-crone : Again, I will be willing to be corrected, but wasn't that because the runtime: parameter disappeared from compose after 2.4 config? That is why I felt that it was a regression. But no, matter now since we all should be on 3.x anyway.

I'd be glad to file an issue, do we do that against the spec in the spec repo, correct?

tgpfeiffer commented 4 years ago

but wasn't that because the runtime: parameter disappeared from compose after 2.4 config? That is why I felt that it was a regression.

Yes, exactly. I have a couple of projects where we rely on using runtime: nvidia in our docker-compose files, and this issue blocks us from upgrading to 3.x because we haven't found a way to use GPUs there.

Motophan commented 4 years ago

Hi, please, please, please fix this. This should be marked mission critical priority -20

chris-crone commented 4 years ago

Again, I will be willing to be corrected, but wasn't that because the runtime: parameter disappeared from compose after 2.4 config? That is why I felt that it was a regression. But no, matter now since we all should be on 3.x anyway.

I wasn't here when the change was made so I'm not 100 % sure why it was dropped. I know that you do not need the NVIDIA runtime to use GPUs any more and that we are evolving the Compose v3 spec in the open here with the intention of making a single version of the spec. This may mean moving some v2 functionality into v3.

In terms of the runtime field, I don't think this is how it should be added to the Compose spec as it is very specific to running on a single node. Ideally we'd want something that'd allow you to specify that your workload has a device need (e.g.: GPU, TPU, whatever comes next) and then let the orchestrator assign the workload to a node that provides that capability.

This discussion should be had on the specification though as it's not Python Docker Compose specific.

vk1z commented 4 years ago

@chris-crone: I mostly concur with your statement. Adding short term hacks is probably the incorrect way to do this since we have a proliferation of edge devices each with their own runtimes. For example, as you point out, TPU (Google), VPU(Intel) and ARM GPU on the Pi. So we do need a more complete story.

I'll file an issue against the specification today and update this thread once I have done so. However, I do think that the orchestrator should be independent - such as if I want to use Kube, I should be able to do so. I'm assuming that will be in scope.

I do however, disagree with the using GPUs statement, since that doesn't work with compose - which is what this is all about. But I think we all understand what problem we would like solved.

vk1z commented 4 years ago

@chris-crone : Please see the docker-compose spec issue filed. I'll follow updates against that issue from now on.

miriaford commented 4 years ago

Can we simply add an option (something like extra_docker_run_args) to pass arguments directly to the underlying docker run? This will not only solve the current problem, but also be future-proof: what if docker adds support for whatever "XPU", "YPU", or any other new features that might come in the future?

If we need a long back-and-forth discussion every time docker adds a new feature, it will be extremely inefficient and cause inevitable delay (and unnecessary confusion) between docker-compose and docker updates. Supporting argument delegation can provide temporary relief for this recurrent issue for all future features.

vk1z commented 4 years ago

@miriaford I'm not sure that passing an uninterpreted blob supports the compose notion of being declarative. The old runtime tag at least indicated that it was something to do with the runtime. Given the direction in which docker is trending (docker-apps), it seems to me that doing this would make declarative deployment harder since an orchestrator would have to parse arbitrary blobs.

But I agree that compose and docker should be synchronized and zapping working features that people depend on (even though it was a major release) isn't quite kosher.

miriaford commented 4 years ago

@vk1z I agree - there should be a much better sync mechanism between compose and docker. However, I don't expect such mechanism to be designed any time soon. Meanwhile we also need a temporary way to do our own synchronization without hacking deep into the source code.

If the argument delegation proposal isn't an option, what do we suggest we do? I agree it isn't a pretty solution, but it's at least much better than this workaround, isn't it? https://github.com/docker/compose/issues/6691#issuecomment-616984053

andyneff commented 4 years ago

@miriaford docker-compose does not call the docker executive with argument, it actually uses the docker_py which uses the http API to the docker daemon. So there is no "underlying docker run" command. The docker CLI is not an API, the socket connection is the API point of contact. This is why it is not always that easy.

To over simplify things, in the process of running a docker, there are two main calls, one that creates the container, and one that starts it, each ingest different pieces of information, and knowing which is while takes someone having API knowledge, which I don't know like I we tend to know the docker CLI. I do not think being able to add extra args to docker_py calls is going to be as useful as you think, except in select use cases.

To make things even more difficult, sometimes the docker_py library is behind the API, and doesn't have everything you need right away either, and you have to wait for it to be updated. All that being said, extra_docker_run_args isn't a simple solution.

miriaford commented 4 years ago

@andyneff Thanks for your explanation. Indeed, I'm not too familiar with the inner workings of Docker. If I understand correctly, there are 4 APIs that need to be manually synced for any new feature updates:

  1. Docker socket API
  2. docker_py that provides python frontend to the socket API
  3. Docker CLI (our familiar entry point to docker toolchain)
  4. Docker-compose interface that calls docker socket API

This begs the question: why is there no automatic (or at least semi-automatic) syncing mechanism? Manually propagating new feature updates across 4 APIs seems doomed to be error-prone, delay-prone, and confusing ...

miriaford commented 4 years ago

P.S. I'm not saying that it's a simple task to have automatic syncing, but I really think there should be one to make life easier in the future.

andyneff commented 4 years ago

I'm kinda getting into pedantics now... But as I would describe it as...

So yes, it goes:

I can't speak for docker_py or compose, but I imagine they have limited man hours contributing to it, so it's harder to keep up with ALL the crazy insane docker features that docker is CONSTANTLY adding. But since docker is a go library, and my understanding is that python support is not (currently) a first class citizen. Although it is nice that both projects are under the docker umbrella, at least from a github organization stand point.


So that all being said... I too am waiting for an equivalent --gpus support, and have to use the old runtime: nvidia method instead, which will at least give me "a" path to move forward in docker-compose 2.x.