Open robvanderleek opened 5 years ago
Hey, We don't have support for sharing a single GPU with multiple containers right now. We have marked it as feature request.
For future reference, my current workaround to have multiple containers share a single GPU:
dockerd
by adding --default-runtime nvidia
to the OPTIONS variable in /etc/sysconfig/docker
Since the default runtime is now nvidia, all containers can access the GPU. You can leave the GPU field empty in the task definition wizard (or set it to 1 for only 1 container to make sure the task is put on a GPU instance).
Major drawback of this workaround is of course forking the standard AMI.
@robvanderleek: thanks for outlining this workaround for now =]
@robvanderleek We have a solution for EKS now. Please let us know if you are interested in it
Hi @Jeffwan
Thanks for the notification but we are happy with what ECS offers in general. Our inference cluster is running fine on ECS, although we have a custom AMI with the nvidia-docker
hack.
Do you expect this solution to also become available for ECS?
@robvanderleek This is implemented like a device plugin in Kubernetes. I doubt it can be used in ECS directly. But overall the GPU sharing theory is similar and I think ECS can adopt a similar solution
Hi just checking in to see if this has been made available for ECS yet, or should we continue with the AMI workaround?
Same question. Is there any expectation for when this might happen, or is this just an unfulfilled feature request with no slated plans for a fix at present?
Thank you @robvanderleek !
We were able to take your suggestion and work it into our setup without having to fork the standard ECS GPU AMI.
We have our EC2 autoscaling group, which serves as a capacity provider for our cluster, provisioned via CloudFormation. As such, we modified the UserData script being passed to the Launch Template that the ASG leverages in order to make the default runtime changes that you suggested.
Here is a working snippet:
(grep -q ^OPTIONS=\"--default-runtime /etc/sysconfig/docker && echo '/etc/sysconfig/docker needs no changes') || (sed -i 's/^OPTIONS="/OPTIONS="--default-runtime nvidia /' /etc/sysconfig/docker && echo '/etc/sysconfig/docker updated to have nvidia runtime as default' && systemctl restart docker && echo 'Restarted docker')
Thought this was worth sharing because although this isn't as ideal as having a real fix, this improves the maintenance impact significantly when compared to forking the AMI. You still have to keep in mind to omit the GPU constraint from your tasks and ensure that GPU instances are used through other means.
Hi @nacitar I was also facing this issue of assigning multiple containers, the same GPU on ECS g4dn.xlarge instance. Is this feature available now through task definition or this API hack is the only available choice as of now ?
Hi @nacitar I was also facing this issue of assigning multiple containers, the same GPU on ECS g4dn.xlarge instance. Is this feature available now through task definition or this API hack is the only available choice as of now ?
@adesgautam I know of no changes done on the AWS side to improve this and am still relying upon the workaround I mentioned above... which has been working without issue since it was implemented.
With Docker version 20.10.7
, I also had to pass the NVIDIA_VISIBLE_DEVICES=0
environment variable in order for my container to pick up the GPU
@robvanderleek We have a solution for EKS now. Please let us know if you are interested in it
Hi, we are interested in the EKS solution but couldn't anything in AWS documentation. Can you please share some links to any kind of documentation you have regarding the EKS solution?
Hi, we are experiencing the same issue but with a hybrid environment having GPU on-premise. Do you have any suggestions on it or does the issue persist in this case?
If there are any drawbacks to this solution let me know. I modified the docker sysconfig file in ec2 user data section like this
sudo rm /etc/sysconfig/docker echo DAEMON_MAXFILES=1048576 | sudo tee -a /etc/sysconfig/docker echo OPTIONS="--default-ulimit nofile=32768:65536 --default-runtime nvidia" | sudo tee -a /etc/sysconfig/docker echo DAEMON_PIDFILE_TIMEOUT=10 | sudo tee -a /etc/sysconfig/docker sudo systemctl restart docker
It does not require to create a new AMI and for now it seems to work.
Any updates on this?
Thank you @robvanderleek !
We were able to take your suggestion and work it into our setup without having to fork the standard ECS GPU AMI.
We have our EC2 autoscaling group, which serves as a capacity provider for our cluster, provisioned via CloudFormation. As such, we modified the UserData script being passed to the Launch Template that the ASG leverages in order to make the default runtime changes that you suggested.
Here is a working snippet:
(grep -q ^OPTIONS=\"--default-runtime /etc/sysconfig/docker && echo '/etc/sysconfig/docker needs no changes') || (sed -i 's/^OPTIONS="/OPTIONS="--default-runtime nvidia /' /etc/sysconfig/docker && echo '/etc/sysconfig/docker updated to have nvidia runtime as default' && systemctl restart docker && echo 'Restarted docker')
Thought this was worth sharing because although this isn't as ideal as having a real fix, this improves the maintenance impact significantly when compared to forking the AMI. You still have to keep in mind to omit the GPU constraint from your tasks and ensure that GPU instances are used through other means.
This script didn't work for me so I had to do this instead:
sudo bash -c 'grep -q "^OPTIONS=\"--default-runtime nvidia " /etc/sysconfig/docker && echo "/etc/sysconfig/docker needs no changes" || (sed -i "s/^OPTIONS=\"/OPTIONS=\"--default-runtime nvidia /" /etc/sysconfig/docker && echo "/etc/sysconfig/docker updated to have nvidia runtime as default" && systemctl restart docker && echo "Restarted docker")'
Summary
I'd like to share the single GPU of a p3.2xlarge instance with multiple containers in the same task.
Description
In the ECS task definition it's not possible to indicate a single GPU can be shared between containers (or to distribute the GPU resource over multiple containers like with CPU units).
I have multiple containers that require a GPU but not at the same time. Is there a way run them in a single task on the same instance? I've tried leaving the GPU unit resource blank but then the GPU device is not visible to the container.