aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 316 forks source link

[ECS] How to share a single GPU with multiple containers #327

Open robvanderleek opened 5 years ago

robvanderleek commented 5 years ago

Summary

I'd like to share the single GPU of a p3.2xlarge instance with multiple containers in the same task.

Description

In the ECS task definition it's not possible to indicate a single GPU can be shared between containers (or to distribute the GPU resource over multiple containers like with CPU units).

I have multiple containers that require a GPU but not at the same time. Is there a way run them in a single task on the same instance? I've tried leaving the GPU unit resource blank but then the GPU device is not visible to the container.

shubham2892 commented 5 years ago

Hey, We don't have support for sharing a single GPU with multiple containers right now. We have marked it as feature request.

robvanderleek commented 5 years ago

For future reference, my current workaround to have multiple containers share a single GPU:

  1. On a running ECS GPU optimized instance, make nvidia-runtime the default runtime for dockerd by adding --default-runtime nvidia to the OPTIONS variable in /etc/sysconfig/docker
  2. Save the instance to a new AMI
  3. In CloudFormation go the Stack created by the ECS cluster wizard and update the EcsAmiId field in the initial template
  4. Restart your services

Since the default runtime is now nvidia, all containers can access the GPU. You can leave the GPU field empty in the task definition wizard (or set it to 1 for only 1 container to make sure the task is put on a GPU instance).

Major drawback of this workaround is of course forking the standard AMI.

adnxn commented 5 years ago

@robvanderleek: thanks for outlining this workaround for now =]

Jeffwan commented 4 years ago

@robvanderleek We have a solution for EKS now. Please let us know if you are interested in it

robvanderleek commented 4 years ago

Hi @Jeffwan

Thanks for the notification but we are happy with what ECS offers in general. Our inference cluster is running fine on ECS, although we have a custom AMI with the nvidia-docker hack.

Do you expect this solution to also become available for ECS?

Jeffwan commented 4 years ago

@robvanderleek This is implemented like a device plugin in Kubernetes. I doubt it can be used in ECS directly. But overall the GPU sharing theory is similar and I think ECS can adopt a similar solution

vbhakta8 commented 3 years ago

Hi just checking in to see if this has been made available for ECS yet, or should we continue with the AMI workaround?

nacitar commented 3 years ago

Same question. Is there any expectation for when this might happen, or is this just an unfulfilled feature request with no slated plans for a fix at present?

nacitar commented 3 years ago

Thank you @robvanderleek !

We were able to take your suggestion and work it into our setup without having to fork the standard ECS GPU AMI.

We have our EC2 autoscaling group, which serves as a capacity provider for our cluster, provisioned via CloudFormation. As such, we modified the UserData script being passed to the Launch Template that the ASG leverages in order to make the default runtime changes that you suggested.

Here is a working snippet: (grep -q ^OPTIONS=\"--default-runtime /etc/sysconfig/docker && echo '/etc/sysconfig/docker needs no changes') || (sed -i 's/^OPTIONS="/OPTIONS="--default-runtime nvidia /' /etc/sysconfig/docker && echo '/etc/sysconfig/docker updated to have nvidia runtime as default' && systemctl restart docker && echo 'Restarted docker')

Thought this was worth sharing because although this isn't as ideal as having a real fix, this improves the maintenance impact significantly when compared to forking the AMI. You still have to keep in mind to omit the GPU constraint from your tasks and ensure that GPU instances are used through other means.

adesgautam commented 3 years ago

Hi @nacitar I was also facing this issue of assigning multiple containers, the same GPU on ECS g4dn.xlarge instance. Is this feature available now through task definition or this API hack is the only available choice as of now ?

nacitar commented 3 years ago

Hi @nacitar I was also facing this issue of assigning multiple containers, the same GPU on ECS g4dn.xlarge instance. Is this feature available now through task definition or this API hack is the only available choice as of now ?

@adesgautam I know of no changes done on the AWS side to improve this and am still relying upon the workaround I mentioned above... which has been working without issue since it was implemented.

spg commented 2 years ago

With Docker version 20.10.7, I also had to pass the NVIDIA_VISIBLE_DEVICES=0 environment variable in order for my container to pick up the GPU

Shurbeski commented 2 years ago

@robvanderleek We have a solution for EKS now. Please let us know if you are interested in it

Hi, we are interested in the EKS solution but couldn't anything in AWS documentation. Can you please share some links to any kind of documentation you have regarding the EKS solution?

NikiBase commented 1 year ago

Hi, we are experiencing the same issue but with a hybrid environment having GPU on-premise. Do you have any suggestions on it or does the issue persist in this case?

makr11 commented 1 year ago

If there are any drawbacks to this solution let me know. I modified the docker sysconfig file in ec2 user data section like this

!/bin/bash

sudo rm /etc/sysconfig/docker echo DAEMON_MAXFILES=1048576 | sudo tee -a /etc/sysconfig/docker echo OPTIONS="--default-ulimit nofile=32768:65536 --default-runtime nvidia" | sudo tee -a /etc/sysconfig/docker echo DAEMON_PIDFILE_TIMEOUT=10 | sudo tee -a /etc/sysconfig/docker sudo systemctl restart docker

It does not require to create a new AMI and for now it seems to work.

NayamAmarshe commented 3 months ago

Any updates on this?

NayamAmarshe commented 3 weeks ago

Thank you @robvanderleek !

We were able to take your suggestion and work it into our setup without having to fork the standard ECS GPU AMI.

We have our EC2 autoscaling group, which serves as a capacity provider for our cluster, provisioned via CloudFormation. As such, we modified the UserData script being passed to the Launch Template that the ASG leverages in order to make the default runtime changes that you suggested.

Here is a working snippet: (grep -q ^OPTIONS=\"--default-runtime /etc/sysconfig/docker && echo '/etc/sysconfig/docker needs no changes') || (sed -i 's/^OPTIONS="/OPTIONS="--default-runtime nvidia /' /etc/sysconfig/docker && echo '/etc/sysconfig/docker updated to have nvidia runtime as default' && systemctl restart docker && echo 'Restarted docker')

Thought this was worth sharing because although this isn't as ideal as having a real fix, this improves the maintenance impact significantly when compared to forking the AMI. You still have to keep in mind to omit the GPU constraint from your tasks and ensure that GPU instances are used through other means.

This script didn't work for me so I had to do this instead:

 sudo bash -c 'grep -q "^OPTIONS=\"--default-runtime nvidia " /etc/sysconfig/docker && echo "/etc/sysconfig/docker needs no changes" || (sed -i "s/^OPTIONS=\"/OPTIONS=\"--default-runtime nvidia /" /etc/sysconfig/docker && echo "/etc/sysconfig/docker updated to have nvidia runtime as default" && systemctl restart docker && echo "Restarted docker")'