aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[ECS] [GPU]: Gracefully handle GPU failure from ECS Agent to mark instance as unhealthy in ASG and terminate running instance. #2432

Open bennydee-aws opened 1 month ago

bennydee-aws commented 1 month ago

Community Note

Tell us about your request For large scale ECS clusters using NVIDIA GPU's for inference, customers can experience infrequent GPU errors due to memory issues, general GPU errors or hardware failing which will make the GPU unavailable on the running instance. In these instances the ECS Agent should detect kernel errors reported by the NVIDIA driver (XID errors) see https://docs.nvidia.com/deploy/xid-errors/index.html

Common NVIDIA XID GPU errors include:

In these cases since the GPU will no longer be usable, the ECS Agent could gracefully detect the failure by:

If a kernel XID error is received and the L1 or L2 diagnostic check fails, the instance can be configured via an ENV variable to mark the instance as unhealthy in the specified ASG using the AWS SDK, and terminate the running instance due to a degraded GPU state.

e.g

aws autoscaling set-instance-health \
    --instance-id $INSTANCE_ID \
    --health-status Unhealthy

And gracefully shutdown the instance after the GPU has failed.

Which service(s) is this request for? ECS - When a GPU is present.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Currently when EC2 instance(s) in an ECS cluster with a GPU fail (due to XID errors), the instance will continue to run, however is in a degraded state with the GPU unavailable to use.

Currently customers must detect this state, manually terminate the instance, and manage a new instance to replace it. At scale this can be a time consuming process and impacts production workloads.

With the option to automatically mark the instance as unhealthy and terminate the running instance, this would help automate the issue to replace degraded hosts and allocate a new instance within a customers ASG.

Are you currently working around this issue? No, however could provide code changes to demonstrate the workaround to help accelerate this new feature.

Additional context

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

amorphic commented 2 days ago

Hey team... :wave: This would be extremely useful to us as the providers of ECS clusters upon which application teams deploy their GPU workloads.

As things stand, these teams detect GPU errors at the application level and alert us. We'd love to be able to more proactively detect and deal with these errors at the system level so that we can deal with them (automatically if possible) without the application teams having to take action.