[ECS] Add support for GPU with Docker 19.03

mfortin commented 5 years ago

Summary

Docker 19.03 now has built-in support for GPU, there is no need to specify an alternate runtime. However, at run time, --gpus all or a specific set of GPUs needs to be passed as an argument and it can't be done with dockerd config.

Description

Without --gpus all

$ docker run --rm nvidia/cuda:10.1-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.

With --gpus all

$ docker run --gpus all --runtime runc --rm nvidia/cuda:10.1-base nvidia-smi
Thu Aug 29 13:54:36 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:0F.0 Off |                    0 |
| N/A   35C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:00:10.0 Off |                    0 |
| N/A   32C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 00000000:00:11.0 Off |                    0 |
| N/A   40C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 00000000:00:12.0 Off |                    0 |
| N/A   35C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 00000000:00:13.0 Off |                    0 |
| N/A   36C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 00000000:00:14.0 Off |                    0 |
| N/A   34C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 00000000:00:15.0 Off |                    0 |
| N/A   40C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 00000000:00:16.0 Off |                    0 |
| N/A   33C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   8  Tesla K80           On   | 00000000:00:17.0 Off |                    0 |
| N/A   36C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   9  Tesla K80           On   | 00000000:00:18.0 Off |                    0 |
| N/A   31C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  10  Tesla K80           On   | 00000000:00:19.0 Off |                    0 |
| N/A   37C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  11  Tesla K80           On   | 00000000:00:1A.0 Off |                    0 |
| N/A   33C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  12  Tesla K80           On   | 00000000:00:1B.0 Off |                    0 |
| N/A   38C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  13  Tesla K80           On   | 00000000:00:1C.0 Off |                    0 |
| N/A   34C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  14  Tesla K80           On   | 00000000:00:1D.0 Off |                    0 |
| N/A   39C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|  15  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Expected Behavior

Containers with GPU requirements should start

Observed Behavior

Environment Details

$ docker info
Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 1
  Paused: 0
  Stopped: 1
 Images: 4
 Server Version: 19.03.1
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: splunk
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
 init version: fec3683
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-957.27.2.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 64
 Total Memory: 720.3GiB
 Name: ip-10-45-8-153.us-west-2.compute.internal
 ID: GA4Z:BCED:2FQG:AUKO:KUAX:7X5W:SBAR:NWB3:IHCH:6HQN:TIFW:PLOB
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 31
  Goroutines: 51
  System Time: 2019-08-29T13:55:51.172917082Z
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true

$ curl http://localhost:51678/v1/metadata
{"Cluster":"BATCHCLUSTER_Batch_0d2792a4-22a0-37e9-8e8a-5e8b68c1be17","ContainerInstanceArn":"arn:aws:ecs:us-west-2::container-instance/BATCHCLUSTER_Batch_0d2792a4-22a0-37e9-8e8a-5e8b68c1be17/50e649e34b83423189684b82669a1cea","Version":"Amazon ECS Agent - v1.30.0 (02ff320c)"}

Supporting Log Snippets

Logs can be provided on request.

sharanyad commented 5 years ago

ECS already has support for running workloads that leverage GPU - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html

Blog post: https://aws.amazon.com/blogs/compute/scheduling-gpus-for-deep-learning-tasks-on-amazon-ecs/

Are you looking for something else that's not provided by the feature?

mfortin commented 5 years ago

I know, but it does not work with Docker 19.03, only with 18.09.

mfortin commented 5 years ago

Docker 18.09, you had to specify --runtime nvidia to run GPU. Now, with 19.03, it is no longer required, however, you have to pass the --gpus argument to a container at runtime to expose GPU.

alecks3474 commented 4 years ago

Here a workaround : change the default runtime for Docker on GPU instances:

Override systemd configuration for docker : Create a file /etc/systemd/system/docker.service.d/override.conf with :
```
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --host=fd://
```

Set nvidia as default runtime in docker daemon : In /etc/docker/daemon.json :

{
"runtimes": {
    "nvidia": {
        "path": "/usr/bin/nvidia-container-runtime",
        "runtimeArgs": []
    }
},
"default-runtime": "nvidia"
}

Restart docker :

systemctl daemon-reload
systemctl restart docker

Check docker service :

systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/docker.service.d
       └─override.conf
Active: active (running) since Mon 2020-01-20 17:35:47 CET; 17h ago
 Docs: https://docs.docker.com
Main PID: 9065 (dockerd)
Tasks: 14
Memory: 100.1M
CGroup: /system.slice/docker.service
       └─9065 /usr/bin/dockerd --host=fd://

Check default runtime :

docker -D info | grep Runtime
Runtimes: nvidia runc
Default Runtime: nvidia

Now, you can launch your "gpu" docker without the option --gpus all and so, work on ECS , voilà
WARNING : if you have other docker containers on the same GPU instance, you have to launch them with --runtime=runc option. Exemple with ecs-agent systemd definition :
```
[Unit]
Description=AWS ECS Agent
...
[Service]
...
ExecStart=/usr/bin/docker run \
--runtime=runc \
--name=ecs-agent \
....
```

tedivm commented 4 years ago

This would be really nice- I recently spent some time getting GPU support working on my own Ubuntu AMI's and one thing I ran into was that the agent still tries to force the nvidia runtime even though it is no longer required. Switching to use the new gpu argument instead of the runtime argument would simplify provisioning machines.

tiagomatic commented 4 years ago

We ran into this issue ourselves. This bug of using the wrong runtime environment with docker 19.03 is blocking us from using a dynamic ECS instance directly from a task via an autoscaling group. There is no way to control the Docker runtime environment , Docker version or AMI for an ECS autoscaling group.

As mentioned, the docker runtime environment (nvidia) and version (19.03) currently do work together for GPU instances run purely without ECS and no modifications to the instance.

kevinclark commented 3 years ago

Just checking to see if there's any plans to do something about this issue. I don't see anybody assigned, but this seems like a major shortcoming of ECS with GPU support. It'd be nice to be able to let ECS start up containers that need gpu without having to hack in device support.

Edit: I'm not able to get the workaround working. Hardcoding the nvidia runtime as default makes ecs-agent fail to come up (which needs runc). Leaving the default runc runtime fails to bring up the nvidia runtime for my gpu-related container.

alecks3474 commented 3 years ago

Edit: I'm not able to get the workaround working. Hardcoding the nvidia runtime as default makes ecs-agent fail to come up (which needs runc). Leaving the default runc runtime fails to bring up the nvidia runtime for my gpu-related container.

For ECS agent, you have to build your own AMI and set ECS agent as a systemd service with specific option runtime :

[Unit]
Description=AWS ECS Agent

[Service]
...
ExecStart=/usr/bin/docker run \
    --runtime=runc \
    --name=ecs-agent \

I use Packer with Ansible to build easly my own AMI.

But you're right : it is just a workaround and ECS with GPU support would be appreciated !

sparrc commented 3 years ago

Hello, so as I understand it the issue that you are all experiencing is that you are not using the ECS-optimized AMI, and you would like to run GPU-enabled ECS tasks without having to configure nvidia-container-runtime?

As a workaround you can essentially do what alecks3474 has suggested in https://github.com/aws/containers-roadmap/issues/457#issuecomment-576606795.

The one caveat being that you shouldn't set nvidia as the default runtime. ECS Agent handles setting nvidia as the runtime for gpu containers when there is a GPU present in the task definition, so setting it as the default is not necessary, and as @kevinclark found will cause issues for the ecs agent container.

Below is from a GPU ecs-optimized AMI, which shows that we have the nvidia runtime enabled but the default runtime is set to runc, so that the ecs agent and other non-gpu containers can still run properly.

% docker -D info | grep Runtime
 Runtimes: nvidia runc
 Default Runtime: runc
% docker -D info | grep "Server Version"
 Server Version: 19.03.6-ce

meet2amit commented 1 year ago

If we run task/service from aws ecs console (new console) where we have to pass --gpus all so it can work with auto scalling group....right now its work if we manually run image in ec2 (after login through ssh) but if i want to manage this from aws ecs console then not sure how or where i can pass (--gpus all)....

i tried https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html this already but still not able to run gpu based container...if i run --gpus all manually (after login through ec2) then it work but it won't it should worked form ecs console only

any help would be appreciated.

saad-cp commented 1 year ago

In case anyone stumbles on this issue, the documentation here https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#getting-started provides steps on how to configure docker to run a docker container with nvidia gpus.

These are the steps I followed and was able to run the docker image with the gpu.

Setting up NVIDIA Container Toolkit Setup the package repository and the GPG key:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

Run the docker with gpu runtime

sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

msciancalepore98 commented 3 months ago

is there any way to enable gpu sharing based on time slicing like in EKS, but in ECS ?

aws / containers-roadmap