Open mfortin opened 5 years ago
ECS already has support for running workloads that leverage GPU - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html
Blog post: https://aws.amazon.com/blogs/compute/scheduling-gpus-for-deep-learning-tasks-on-amazon-ecs/
Are you looking for something else that's not provided by the feature?
I know, but it does not work with Docker 19.03, only with 18.09.
Docker 18.09, you had to specify --runtime nvidia to run GPU. Now, with 19.03, it is no longer required, however, you have to pass the --gpus argument to a container at runtime to expose GPU.
Here a workaround : change the default runtime for Docker on GPU instances:
Override systemd configuration for docker :
Create a file /etc/systemd/system/docker.service.d/override.conf
with :
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --host=fd://
Set nvidia
as default runtime in docker daemon :
In /etc/docker/daemon.json
:
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
Restart docker :
systemctl daemon-reload
systemctl restart docker
Check docker service :
systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─override.conf
Active: active (running) since Mon 2020-01-20 17:35:47 CET; 17h ago
Docs: https://docs.docker.com
Main PID: 9065 (dockerd)
Tasks: 14
Memory: 100.1M
CGroup: /system.slice/docker.service
└─9065 /usr/bin/dockerd --host=fd://
Check default runtime :
docker -D info | grep Runtime
Runtimes: nvidia runc
Default Runtime: nvidia
Now, you can launch your "gpu" docker without the option --gpus all
and so, work on ECS , voilà
WARNING : if you have other docker containers on the same GPU instance, you have to launch them with --runtime=runc
option.
Exemple with ecs-agent
systemd definition :
[Unit]
Description=AWS ECS Agent
...
[Service]
...
ExecStart=/usr/bin/docker run \
--runtime=runc \
--name=ecs-agent \
....
This would be really nice- I recently spent some time getting GPU support working on my own Ubuntu AMI's and one thing I ran into was that the agent still tries to force the nvidia runtime even though it is no longer required. Switching to use the new gpu
argument instead of the runtime argument would simplify provisioning machines.
We ran into this issue ourselves. This bug of using the wrong runtime environment with docker 19.03 is blocking us from using a dynamic ECS instance directly from a task via an autoscaling group. There is no way to control the Docker runtime environment , Docker version or AMI for an ECS autoscaling group.
As mentioned, the docker runtime environment (nvidia) and version (19.03) currently do work together for GPU instances run purely without ECS and no modifications to the instance.
Just checking to see if there's any plans to do something about this issue. I don't see anybody assigned, but this seems like a major shortcoming of ECS with GPU support. It'd be nice to be able to let ECS start up containers that need gpu without having to hack in device support.
Edit: I'm not able to get the workaround working. Hardcoding the nvidia runtime as default makes ecs-agent fail to come up (which needs runc). Leaving the default runc runtime fails to bring up the nvidia runtime for my gpu-related container.
Edit: I'm not able to get the workaround working. Hardcoding the nvidia runtime as default makes ecs-agent fail to come up (which needs runc). Leaving the default runc runtime fails to bring up the nvidia runtime for my gpu-related container.
For ECS agent, you have to build your own AMI and set ECS agent as a systemd service with specific option runtime
:
[Unit]
Description=AWS ECS Agent
[Service]
...
ExecStart=/usr/bin/docker run \
--runtime=runc \
--name=ecs-agent \
I use Packer with Ansible to build easly my own AMI.
But you're right : it is just a workaround and ECS with GPU support would be appreciated !
Hello, so as I understand it the issue that you are all experiencing is that you are not using the ECS-optimized AMI, and you would like to run GPU-enabled ECS tasks without having to configure nvidia-container-runtime
?
As a workaround you can essentially do what alecks3474 has suggested in https://github.com/aws/containers-roadmap/issues/457#issuecomment-576606795.
The one caveat being that you shouldn't set nvidia
as the default runtime. ECS Agent handles setting nvidia as the runtime for gpu containers when there is a GPU present in the task definition, so setting it as the default is not necessary, and as @kevinclark found will cause issues for the ecs agent container.
Below is from a GPU ecs-optimized AMI, which shows that we have the nvidia runtime enabled but the default runtime is set to runc, so that the ecs agent and other non-gpu containers can still run properly.
% docker -D info | grep Runtime
Runtimes: nvidia runc
Default Runtime: runc
% docker -D info | grep "Server Version"
Server Version: 19.03.6-ce
If we run task/service from aws ecs console (new console) where we have to pass --gpus all so it can work with auto scalling group....right now its work if we manually run image in ec2 (after login through ssh) but if i want to manage this from aws ecs console then not sure how or where i can pass (--gpus all)....
i tried https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html this already but still not able to run gpu based container...if i run --gpus all manually (after login through ec2) then it work but it won't it should worked form ecs console only
any help would be appreciated.
In case anyone stumbles on this issue, the documentation here https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#getting-started provides steps on how to configure docker to run a docker container with nvidia gpus.
These are the steps I followed and was able to run the docker image with the gpu.
Setting up NVIDIA Container Toolkit Setup the package repository and the GPG key:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Run the docker with gpu runtime
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
is there any way to enable gpu sharing based on time slicing like in EKS, but in ECS ?
Summary
Docker 19.03 now has built-in support for GPU, there is no need to specify an alternate runtime. However, at run time,
--gpus all
or a specific set of GPUs needs to be passed as an argument and it can't be done with dockerd config.Description
Without
--gpus all
With
--gpus all
Expected Behavior
Containers with GPU requirements should start
Observed Behavior
Environment Details
Supporting Log Snippets
Logs can be provided on request.