[ECS] [GPU]: Gracefully handle GPU failure from ECS Agent to mark instance as unhealthy in ASG and terminate running instance.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request For large scale ECS clusters using NVIDIA GPU's for inference, customers can experience infrequent GPU errors due to memory issues, general GPU errors or hardware failing which will make the GPU unavailable on the running instance. In these instances the ECS Agent should detect kernel errors reported by the NVIDIA driver (XID errors) see https://docs.nvidia.com/deploy/xid-errors/index.html

Common NVIDIA XID GPU errors include:

XID 79: GPU has fallen off the bus
XID 48: An uncorrectable double bit error (DBE) has been detected on GPU

In these cases since the GPU will no longer be usable, the ECS Agent could gracefully detect the failure by:

Reading kernel error messages provided by the NVIDIA drivers to /dev/kmesg or periodically checked via dmesg/journalctl (example https://github.com/cirocosta/dmesg_exporter/blob/fc7962f46d304c158f21cc71cf2049af7d183c0a/kmsg/kmsg.go#L178)
When an XID error is detected run a simple GPU diagnostic to confirm the GPU is unavailable
Using the existing nvml Go bindings in the ECS Agent, run a level 1 diagnostic the GPU is accessible (sample https://github.com/NVIDIA/go-nvml/blob/main/examples/compute-processes/main.go)
Optionally, run more comprehensive diagnostics the GPU is accessible using the NVIDIA DCGM utility for a L2 diagnostic, configurable via an ENV variable, similar to how AWS ParallelCluster >3.6 can run these tests for each job run (https://aws.amazon.com/blogs/hpc/introducing-gpu-health-checks-in-aws-parallelcluster-3-6/)

If a kernel XID error is received and the L1 or L2 diagnostic check fails, the instance can be configured via an ENV variable to mark the instance as unhealthy in the specified ASG using the AWS SDK, and terminate the running instance due to a degraded GPU state.

e.g

aws autoscaling set-instance-health \
    --instance-id $INSTANCE_ID \
    --health-status Unhealthy

And gracefully shutdown the instance after the GPU has failed.

Which service(s) is this request for? ECS - When a GPU is present.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Currently when EC2 instance(s) in an ECS cluster with a GPU fail (due to XID errors), the instance will continue to run, however is in a degraded state with the GPU unavailable to use.

Currently customers must detect this state, manually terminate the instance, and manage a new instance to replace it. At scale this can be a time consuming process and impacts production workloads.

With the option to automatically mark the instance as unhealthy and terminate the running instance, this would help automate the issue to replace degraded hosts and allocate a new instance within a customers ASG.

Are you currently working around this issue? No, however could provide code changes to demonstrate the workaround to help accelerate this new feature.

Additional context

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

aws / containers-roadmap

[ECS] [GPU]: Gracefully handle GPU failure from ECS Agent to mark instance as unhealthy in ASG and terminate running instance. #2432

Community Note