docker / compose

Define and run multi-container applications with Docker
https://docs.docker.com/compose/
Apache License 2.0
34.01k stars 5.23k forks source link

Support for NVIDIA GPUs under Docker Compose #6691

Closed collabnix closed 3 years ago

collabnix commented 5 years ago

Under Docker 19.03.0 Beta 2, support for NVIDIA GPU has been introduced in the form of new CLI API --gpus. https://github.com/docker/cli/pull/1714 talk about this enablement.

Now one can simply pass --gpus option for GPU-accelerated Docker based application.

$ docker run -it --rm --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
f476d66f5408: Pull complete 
8882c27f669e: Pull complete 
d9af21273955: Pull complete 
f5029279ec12: Pull complete 
Digest: sha256:d26d529daa4d8567167181d9d569f2a85da3c5ecaf539cace2c6223355d69981
Status: Downloaded newer image for ubuntu:latest
Tue May  7 15:52:15 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.116                Driver Version: 390.116                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P0    22W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
:~$ 

As of today, Compose doesn't support this. This is a feature request for enabling Compose to support for NVIDIA GPU.

lig commented 4 years ago

@andyneff FYI there is Docker CLI support in the latest docker-compose. It allows using buildkit for instance. https://www.docker.com/blog/faster-builds-in-compose-thanks-to-buildkit-support/

miriaford commented 4 years ago

@andyneff this is a very helpful overview! Thanks again

andyneff commented 4 years ago

@lig awesome! Thanks for the correction! I was actually thinking "How will buildkit fit into all this" as I was writing that up

vk1z commented 4 years ago

What I am a bit suprised by is that docker-compose is a pretty intrinsic part of the new docker-app framework and I'd imagine that they'd want to sync up docker-compose and docker for at least that reason. I wonder what the blocker really is: Not enough python bandwidth? Seems a bit unbelievable.

carlfischerjba commented 4 years ago

So how does Docker Swarm fit into the structure that @andyneff just described? Swarm uses the compose file format version 3 (defined by the "compose" project?) but is developed as part of docker?

Apologies if that's off-topic for this particular issue. I've rather lost track of which issue is which but I started following this because I'd like to be able to tell a service running on a swarm that it needs to use a particular runtime. We can only do that with v2 of the compose-file spec which means we can't do it with Swarm which requires v3. In other words, I'm not really interested in what the docker-compose CLI does but only in the spec defined for docker-compose.yml files that are consumed by docker swarm.

andyneff commented 4 years ago

Oh swarm, the one that got away... (from me). Unfortunately that is #6239 that got closed by a BOT. :( Someone tried in #6240 but was told that...

@miriaford, it looks like there is a PR for syncing them! #6642?! (Is this just for v3???)


So because of the nature of swarm, there are certain things you do and don't do on swarm nodes. So the Docker API doesn't always allow you to do the same options on a swarm run, as a normal run. I don't know if runtime is one of these things off hand, but that is often why you can't do things in v3 (the swarm compatible version) and can in v2 (the non-swarm compatible version).

Motophan commented 4 years ago

No one reading this knows what you guys are talking about. We all are trying to deploy jellyfin w/ hardware acceleration. Until you guys fix this back to the way its suppose to be, when it says service, 3.x is no good. Dont use it.

You need to put 2.4 for service. Then you can use hardware acceleration for jellyfin, ez

ghost commented 4 years ago

So come on guys, whats the ETA on this, 1 year, 2 years?

fakabbir commented 4 years ago

@KlaasH @ulyssessouza @Goryudyuma @chris-crone Hi, I'm working on this issue, I found that the support was missing in "docker-py", have worked on that part. Now to get it working I need to pass the configs via docker-compose.yml file. Can you help me with the schema ? i.e In order to add it should I add it to a new schema or is there any place where the configs could be passed

lig commented 4 years ago

@fakabbir I would assume it is ok to just use COMPOSE_DOCKER_CLI_BUILD for this. Adding an ability to provide and arbitrary list of docker run arguments could even help to avoid similar issues in the future.

hadim commented 4 years ago

@lig how do you deal when only one service requires access to a GPU?

ben-z commented 4 years ago

@lig AFAICS compose uses docker-py instead of the docker run cli. So adding an arbitrary docker run arguments wouldn't work unless docker-py supports it as well.

ref: https://github.com/docker/compose/issues/6691#issuecomment-585199425

inquam commented 4 years ago

This single thing brings down the usefulness of docker-compose hugely for many people. That it hasn't seen much attention and desire to fix it, especially when it worked in older docker-compose, is quite astonishing. Wouldn't one way to go be to allow arbitrary docker --run arguments to be given in a docker-compose file? Then --gpus all for instance could be passed to docker.

I understand there can be philosophical or technical reasons why one might want to do it in a particular way. But not getting hands on and doing it in ANY way staggers the mind.

inquam commented 4 years ago

@lig how do you deal when only one service requires access to a GPU?

Well the environment variable NVIDIA_VISIBLE_DEVICES will allow you to control that no?

hadim commented 4 years ago

This single thing brings down the usefulness of docker-compose hugely for many people. That it hasn't seen much attention and desire to fix it, especially when it worked in older docker-compose, is quite astonishing. Wouldn't one way to go be to allow arbitrary docker --run arguments to be given in a docker-compose file? Then --gpus all for instance could be passed to docker.

I don't think to allow passing docker --run args is the way to go. compose does not really call docker by itself but instead uses docker-py.

I understand there can be philosophical or technical reasons why one might want to do it in a particular way. But not getting hands on and doing it in ANY way staggers the mind.

A PR is open about it: https://github.com/docker/compose/pull/7124. Please feel free to "get your hands on it".

vk1z commented 4 years ago

I believe that as per change in docker compose spec, we should be back soon to earlier compatibility as per compose 2.4 and it the nvidia runtime will work. It obviously won't work for TPUs or other accelerators - which is very unfortunate but for those who want to run (expensive) nvidia gpus, it will work.

ghost commented 4 years ago

So just waiting on a green PR in docker-py to be merged https://github.com/docker/docker-py/pull/2471

ghost commented 4 years ago

YEAH! The PR over at docker-py has been approved! https://github.com/docker/docker-py/pull/2471 What the next step here?

nicolas-b12 commented 4 years ago

What's up here ? It would be cool to be able to support nvidia runtime in docker-compose

hadim commented 4 years ago

https://github.com/docker/docker-py/pull/2471 has been merged.

bkakilli commented 4 years ago

Now that docker/docker-py#2471 has been merged we can install the docker-py from master. But since the docker-compose has changed since @yoanisgil 's cool [PR] (https://github.com/docker/compose/pull/7124) (Kudos!), it is unlikely to get merged. So at this point, the docker-compose can be installed from that PR to save the day.

For those who ended up here without seeing the previous comments:

pip install git+https://github.com/docker/docker-py.git
pip install git+https://github.com/yoanisgil/compose.git@device-requests

Then use the following template in your compose file. (source: comment):

And then run COMPOSE_API_VERSION=auto docker-compose run gpu with the following file:

version: '3.7'

services:
    gpu:
        image: 'nvidia/cuda:9.0-base'
        command: 'nvidia-smi'
        device_requests:
            - capabilities:
               - "gpu"

I confirm that this worked on my local machine. Don't know it works with Swarm.

ghost commented 4 years ago

Can't have a particular commit of docker-compose in production. Does #7124 need to be rebased or is there another PR thats going to incorporate the new docker-py?

frgfm commented 4 years ago

Hi there @bkakilli,

Thanks for the help! I just tried your suggestion, but I get an error running my docker-compose

ERROR: The Compose file './docker-compose.yml' is invalid because:
Unsupported config option for services.analysis: 'device_requests'

analysis being my container's name

I changed my docker-compose.yml from:

version: '2.3'

services:
    analysis:
        container_name: analysis
        image: analysis:${TAG}
        runtime: nvidia
        restart: always
        ports:
            - "8000:80"

to:

version: '3.7'

services:
    analysis:
        container_name: analysis
        image: analysis:${TAG}
        device_requests:
          - capabilities:
            - "gpu"
        restart: always
        ports:
            - "8000:80"

Is there anything else apart from both pip install git+ to correctly set this up? Or perhaps I edited the configuration file badly?

bkakilli commented 4 years ago

@frgfm make sure you're installing compose and docker-py from correct links. You may have used the docker-compose's own repo instead of yoanisgil's fork (and branch). See if you're using the following link:

pip install git+https://github.com/yoanisgil/compose.git@device-requests

You may try putting --upgrade param to pip install. Otherwise I would suspect the virtual environment settings. Maybe you have another docker-compose installation, which is being used by default? E.g you may have installed it for the system with the "Linux" instructions here: https://docs.docker.com/compose/install/. I suggest you to take a look at "Alternative Install Options" and installing via pip in the virtual environment (but use pip install command above. Don't install the default docker-compose from PyPI).

ghost commented 4 years ago

Hi! Thanks for all the info. I was trying to run your approach @bkakilli and docker-compose build worked but when running docker-compose up I got the error: docker.errors.InvalidVersion: device_requests param is not supported in API versions < 1.40

My docker_compose.yml looks like this:

version: '3.7'

networks:
  isolation-network:
    driver: bridge

services:
 li_t5_service:
    build: .
    ports:
      - "${GRAPH_QL_API_PORT}:5001"
    device_requests:
      - capabilities:
        - "gpu"
    environment:
      - SSH_PRIVATE_KEY=${SSH_PRIVATE_KEY}
      - PYTHONUNBUFFERED=${PYTHONUNBUFFERED}
    networks: 
      - isolation-network

Thanks in advance!

EpicWink commented 4 years ago

@ugmSorcero Set the environment variable COMPOSE_API_VERSION=1.40 then re-run your commands

jjrugui commented 4 years ago

@ugmSorcero did you manage to fix that error? @EpicWink @bkakilli I'm running the version stated from the pip install but I still get the error for device_requests param is not supported in API versions < 1.40 even if I export such variable to 1.40

EpicWink commented 4 years ago

For the given compose file

version: "3.7"
services:
  spam:
    image: nvidia/cuda:10.1-cudnn7-runtime
    command: nvidia-smi
    device_requests:
      - capabilities:
          - gpu

Using the version of docker-compose installed as above, in Bash on Linux, the following command succeeds:

COMPOSE_API_VERSION=1.40 docker-compose up

The following command fails:

docker-compose up

This has error output:

ERROR: for tmp_spam_1  device_requests param is not supported in API versions < 1.40
...
docker.errors.InvalidVersion: device_requests param is not supported in API versions < 1.40
jjrugui commented 4 years ago

@EpicWink thank you very much. I didn't realize that docker-compose up had to be executed that way. I took it as a 2 step where first I exported COMPOSE_API_VERSION separately. Running it together seems to work :)

I have another issue, though. If I run COMPOSE_API_VERSION=1.40 docker-compose run nvidiatest then nvidia-smi is not found in the path, while if I run directly from the image there is no issue.

Here's how I'm reproducing it.

docker-compose local file contains:

nvidiatest:
    image: nvidia/cuda:10.0-base
    device_requests:
      - capabilities:
        - gpu
    command: nvidia-smi

If I run my current setup (both api version auto and 1.40) I get the following error:

COMPOSE_API_VERSION=auto docker-compose -f docker-compose.yml -f docker-compose.local.yml run nvidiatest
Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown

Is it possible that it has to do with using override files? If I just run the cuda base image with Docker there's no problem with getting output from nvidia-smi:

docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Mon Aug 24 11:40:04 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:29:00.0  On |                  N/A |
|  0%   46C    P8    19W / 175W |    427MiB /  7974MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I installed docker-compose following the instructions above from git after uninstalling the version installed from the official docs. Here's the info of the version installed:

pip3 show --verbose docker-compose
Name: docker-compose
Version: 1.26.0.dev0
Summary: Multi-container orchestration for Docker
Home-page: https://www.docker.com/
Author: Docker, Inc.
Author-email: None
License: Apache License 2.0
Location: /home/jurugu/.local/lib/python3.8/site-packages
Requires: docopt, docker, requests, PyYAML, texttable, websocket-client, six, dockerpty, jsonschema, cached-property
Required-by:
Metadata-Version: 2.1
Installer: pip
Classifiers:
  Development Status :: 5 - Production/Stable
  Environment :: Console
  Intended Audience :: Developers
  License :: OSI Approved :: Apache Software License
  Programming Language :: Python :: 2
  Programming Language :: Python :: 2.7
  Programming Language :: Python :: 3
  Programming Language :: Python :: 3.4
  Programming Language :: Python :: 3.6
  Programming Language :: Python :: 3.7
Entry-points:
  [console_scripts]
  docker-compose = compose.cli.main:main

Am I missing anything? Thanks for the help!

EpicWink commented 4 years ago

@jjrugui this is becoming off-topic, and I'm not able to replicate your issue. Sorry for not being able to help

jjrugui commented 4 years ago

@EpicWink not a problem, and sorry for deviating from the topic :). If I figure out my particular issue I'll post it here if it's relevant.

ghost commented 4 years ago

Is someone working on another PR or are we debugging the device-requests branch in order to get ready for a PR?

visheratin commented 4 years ago

While the PR is stuck, I ported changes from #7124 to the latest version from the master branch to match dependencies, etc. - https://github.com/beehiveai/compose You can install with pip install git+https://github.com/beehiveai/compose.git and change the version in docker-compose.yml to 3.8:

version: "3.8"
services:
  gpu-test:
    image: nvidia/cuda:10.2-runtime
    command: nvidia-smi
    device_requests:
      - capabilities:
          - gpu

In this setting, everything works as expected.

ndeloof commented 4 years ago

As discussed yesterday on compose-spec governance meeting, we will start working on a proposal to adopt something comparable to #7124, which could be close to generic_resouces already available on deploy section.

visheratin commented 4 years ago

@ndeloof That is great! If it is possible, please post the link to the proposal here. I think many people would be happy to contribute to this since GPU support is critical for deep learning deployments.

awhillas commented 4 years ago

@ndeloof historically, how long does it take the steering committee to make a decision, 6 months, a year?

cvlvxi commented 4 years ago

+1

jaxs-ribs commented 4 years ago

+1

proximous commented 4 years ago

@visheratin Any chance you can improve your fix so that it works when using multiple compose yml files? I have a base docker-compose.yml that uses a non-nvidia container, that I want to override with nvidia container when there is a GPU, however it seems that with your fix, if I specify multiple compose yml files with the "-f", the "device_requests" fields drops out of the config.

visheratin commented 4 years ago

@proximous What do you mean by "drops out of the config"? Do all compose files have version 3.8? Can you share the example so it would be easier to reproduce?

jlaule commented 4 years ago

Having a problem with the code in compose/service.py when trying to use the --scale option with docker-compose up. Is this not supported?

Traceback (most recent call last): File "/usr/local/bin/docker-compose", line 11, in load_entry_point('docker-compose==1.27.0.dev0', 'console_scripts', 'docker-compose')() File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 67, in main command() File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 123, in perform_command handler(command, command_options) File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 1067, in up to_attach = up(False) File "/usr/local/lib/python3.6/site-packages/compose/cli/main.py", line 1063, in up cli=native_builder, File "/usr/local/lib/python3.6/site-packages/compose/project.py", line 648, in up get_deps, File "/usr/local/lib/python3.6/site-packages/compose/parallel.py", line 108, in parallel_execute raise error_to_reraise File "/usr/local/lib/python3.6/site-packages/compose/parallel.py", line 206, in producer result = func(obj) File "/usr/local/lib/python3.6/site-packages/compose/project.py", line 634, in do override_options=override_options, File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 579, in execute_convergence_plan renew_anonymous_volumes, File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 509, in _execute_convergence_recreate scale - len(containers), detached, start File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 479, in _execute_convergence_create "Creating" File "/usr/local/lib/python3.6/site-packages/compose/parallel.py", line 108, in parallel_execute raise error_to_reraise File "/usr/local/lib/python3.6/site-packages/compose/parallel.py", line 206, in producer result = func(obj) File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 477, in lambda service_name: create_and_start(self, service_name.number), File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 456, in create_and_start container = service.create_container(number=n, quiet=True) File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 333, in create_container previous_container=previous_container, File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 936, in _get_container_create_options one_off=one_off) File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 1014, in _get_container_host_config element.split(',') for element in device_request['capabilities']] File "/usr/local/lib/python3.6/site-packages/compose/service.py", line 1014, in element.split(',') for element in device_request['capabilities']] AttributeError: 'list' object has no attribute 'split'

After further debugging, I found that when using the --scale, that for some reason one instance has the device_requests['capabilities'] as ['gpu']. But for all other containers to be started, the device_request['capabilities'] instead looks like [['gpu']].

I made a temporary fix locally to get around this issue just to get my containers up and running starting at line 1010 in compose/service.py:

        for device_request in device_requests:
            if 'capabilities' not in device_request:
                continue
            if type(device_request['capabilities'][0]) == list:
                device_request['capabilities'] = [
                    element.split('.') for element in device_request['capabilities'][0]]
            else:
                device_request['capabilities'] = [
                    element.split('.') for element in device_request['capabilities']]
proximous commented 4 years ago

@proximous What do you mean by "drops out of the config"? Do all compose files have version 3.8? Can you share the example so it would be easier to reproduce?

@visheratin see this example, am I wrong to expect a different result?

docker-compose.nogpu.yml:

version: '3.8'

services:
  df:
    build: miniconda-image.Dockerfile

docker-compose.gpu.yml:

version: '3.8'

services:
  df:
    build: nvidia-image.Dockerfile
    device_requests:
      - capabilities:
          - gpu

use only the nogpu.yml:

$ docker-compose -f docker-compose.nogpu.yml config
services:
  df:
    build:
      context: /home/jerry/gpu-test/miniconda-image.Dockerfile
version: '3'

use only the gpu.yml:

$ docker-compose -f docker-compose.gpu.yml config
services:
  df:
    build:
      context: /home/jerry/gpu-test/nvidia-image.Dockerfile
    device_requests:
    - capabilities:
      - gpu
version: '3'

chain config ymls starting with a non-gpu yml (note...missing the runtime):

$ docker-compose -f docker-compose.nogpu.yml -f docker-compose.gpu.yml config
services:
  df:
    build:
      context: /home/jerry/gpu-test/nvidia-image.Dockerfile
version: '3'

expected output:

$ docker-compose -f docker-compose.nogpu.yml -f docker-compose.gpu.yml config
services:
  df:
    build:
      context: /home/jerry/gpu-test/nvidia-image.Dockerfile
    device_requests:
      - capabilities:
          - gpu
version: '3'

(Obviously I'm trying to something more elaborate and this is just a simplified case to highlight the unexpected behavior.)

visheratin commented 4 years ago

@jlaule @proximous In order to keep this thread on topic, please create issues in the forked repo, I will look into them when I have time.

ghost commented 4 years ago

For those who need something while waiting, i just setup K3S (edge version of Kubernetes) with GPU support in 30mins using docker as a container run time (i.e. use the --docker option to the install script). Follow https://github.com/NVIDIA/k8s-device-plugin to get the Nvidia device plugin working. Hope that helps!

tranv94 commented 4 years ago

@EpicWink not a problem, and sorry for deviating from the topic :). If I figure out my particular issue I'll post it here if it's relevant.

Did you ever resolve this?

NazgulLee commented 4 years ago

There is no such thing like "/usr/bin/nvidia-container-runtime" anymore. Issue is still critical.

Install nvidia-docker2 as instruceted here

haviduck commented 4 years ago

ive been tackling This lately and thought id share my approach. my problem was that i needed to docker stack deploy and it didnt want to listen. docker compose i had working with the docker api version hack but it didnt feel right and stack deploy wouldnt work regardless.

so without setting any run time pr device requests in my docker compose, i added This to my daemon:

{ "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "default-runtime": "nvidia", "node-generic-resources": [ "NVIDIA-GPU=0" ] }

u can also use GPU-{first part of gpu guid} but This was easier. didnt have to install any pip+ or anything like that except the NV container toolkit. it deploys and works like a charm.

pommedeterresautee commented 4 years ago

Tks a lot @haviduck , just tried on my own machine (Ubuntu 20.04, docker CE 19.03.8) and it worked like a charm. For others: don't forget to restart your docker daemon.

haviduck commented 4 years ago

@pommedeterresautee ah im so glad it worked for others! should have mentioned the reload.

gotta say after 3 weeks of non stop dockering im pretty baffled how nothing seems to work..

inquam commented 4 years ago

@haviduck: Thank you! Finally a simple solution that just works. I have spent so much time trying to add devices etc that I gave up. Then this comes along, tries it and after a couple of minutes I have hardware transcoding in Plex working.