Closed truenorth8 closed 5 months ago
Hi @truenorth8 , thanks for reporting the issue. May I know what AMI is being used by your deployment? As a verification, I just ran some sanity tests with 20240610
Kernel 5.10 GPU AMI, and the agent and GPUs seem to work OK. Could you verify that you have this AMI in your deployment?
Also, normally ECS AMIs should not be seeing version mismatch between drivers and client and these should not be updated during instance launches. So I do not expect nvidia driver related logs in cloud-init-output. Could you verify if you might have nvidia updates enabled? We recommend you have them disabled (these are disabled in base Amazon Linux AMIs and ECS AMIs) as upstream Amazon Linux updates might trigger failures
@prateekchaudhry yum update -y
was the culprit, removing it caused the issue to go away
Summary
The "amazon-linux-2/kernel-5.10/gpu" image encounters errors in the NVidia driver on startup, causing the ECS agent to fail.
Description
I'm running ECS on EC2 with g4dn instances. I use
amazon-linux-2/kernel-5.10/gpu/recommended
which is deployed using cdk. At the of writing, this resolved to kernel5.10.217-205.860.amzn2.x86_64
(see docker stats below for more details)On 11 June 2024. ~7am UTC I deployed a new version of my app. This deploy terminated existing instances and replaced them with new ones (intended). However, the new EC2 instances are show nvidia errors in the System log that didn't appear before, causing the ECS agent to fail. Meaning the agent does not register itself with ECS, and does not launch containers.
It's clear from the logs that the error is related to the NVidia drivers. The last working deploy was at 1 day earlier, 10 June 2024 ~8am UTC.
I also run a userdata script on instance startup, though the error seems to occur before this script runs. And I'm not modifying the NVidia drivers, at least not intentionally. The lines above "install aws-cli" were added automatically by cdk.
Expected Behavior
The instance launches without driver errors.
Observed Behavior
The instance logs errors related to the NVidia driver, and the ECS-agent doesn't function normally.
Environment Details
Kernel module versions
Supporting Log Snippets
Please let me know if there's anything I can do to prevent this issue from happening in the future.