Open bennydee-aws opened 1 month ago
Hey team... :wave: This would be extremely useful to us as the providers of ECS clusters upon which application teams deploy their GPU workloads.
As things stand, these teams detect GPU errors at the application level and alert us. We'd love to be able to more proactively detect and deal with these errors at the system level so that we can deal with them (automatically if possible) without the application teams having to take action.
Community Note
Tell us about your request For large scale ECS clusters using NVIDIA GPU's for inference, customers can experience infrequent GPU errors due to memory issues, general GPU errors or hardware failing which will make the GPU unavailable on the running instance. In these instances the ECS Agent should detect kernel errors reported by the NVIDIA driver (XID errors) see https://docs.nvidia.com/deploy/xid-errors/index.html
Common NVIDIA XID GPU errors include:
In these cases since the GPU will no longer be usable, the ECS Agent could gracefully detect the failure by:
nvml
Go bindings in the ECS Agent, run a level 1 diagnostic the GPU is accessible (sample https://github.com/NVIDIA/go-nvml/blob/main/examples/compute-processes/main.go)NVIDIA DCGM
utility for a L2 diagnostic, configurable via an ENV variable, similar to how AWS ParallelCluster >3.6 can run these tests for each job run (https://aws.amazon.com/blogs/hpc/introducing-gpu-health-checks-in-aws-parallelcluster-3-6/)If a kernel XID error is received and the L1 or L2 diagnostic check fails, the instance can be configured via an ENV variable to mark the instance as unhealthy in the specified ASG using the AWS SDK, and terminate the running instance due to a degraded GPU state.
e.g
And gracefully shutdown the instance after the GPU has failed.
Which service(s) is this request for? ECS - When a GPU is present.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? Currently when EC2 instance(s) in an ECS cluster with a GPU fail (due to XID errors), the instance will continue to run, however is in a degraded state with the GPU unavailable to use.
Currently customers must detect this state, manually terminate the instance, and manage a new instance to replace it. At scale this can be a time consuming process and impacts production workloads.
With the option to automatically mark the instance as unhealthy and terminate the running instance, this would help automate the issue to replace degraded hosts and allocate a new instance within a customers ASG.
Are you currently working around this issue? No, however could provide code changes to demonstrate the workaround to help accelerate this new feature.
Additional context
Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)