aws / amazon-ecs-agent

Amazon Elastic Container Service Agent
http://aws.amazon.com/ecs/
Apache License 2.0
2.07k stars 604 forks source link

nvidia-gpu-info.json not being generated since v1.82.4 #4254

Open GlassBil opened 1 month ago

GlassBil commented 1 month ago

Summary

/var/lib/ecs/gpu/nvidia-gpu-info.json isn't automatically being generated since ECS v1.83.0.

Description

Running Amazon Linux 2023 (ECS Optimized) AMI on EC2 gd4n.xlarge. On version 2023.4.20240528 (Amazon ECS Agent - v1.82.4) it automatically generates /var/lib/ecs/gpu/nvidia-gpu-info.json when ECS GPU support is enabled. However, since version 2023.4.20240611 (Amazon ECS Agent - v1.83.0) this no longer happens. It generates the gpu directory inside /var/lib/ecs but no file is generated inside the directory.

Expected Behavior

Expect nvidia-gpu-info.json to be generated so that Docker can use the GPU. This leads to the docker being unable to

Observed Behavior

gpu directory is created, but the file is not.

Environment Details

gd4n.xlarge Amazon Linux 2023 2023.4.20240611 or newer GPU drivers: NVIDIA-Linux-x86_64-550.54.14.run

Supporting Log Snippets

ecs-agent docker logs:

level=error time=2024-07-24T06:07:07Z msg="Config for GPU support is enabled, but GPU information is not found; continuing without it" module=nvidia_gpu_manager_unix.go

system /var/log/ecs/ecs-init.log

level=info time=2024-07-24T06:07:11Z msg="pre-start"
level=info time=2024-07-24T06:07:11Z msg="Successfully created docker client with API version 1.25"
level=info time=2024-07-24T06:07:11Z msg="pre-start: setting up GPUs"
level=info time=2024-07-24T06:07:11Z msg="By using the GPU Optimized AMI, you agree to Nvidia’s End User License Agreement: https://www.nvidia.com/en-us/about-nvidia/eula-agreement/"
level=info time=2024-07-24T06:07:11Z msg="post-stop"
level=info time=2024-07-24T06:07:11Z msg="Cleaning up the credentials endpoint setup for Amazon Elastic Container Service Agent"
level=error time=2024-07-24T06:07:11Z msg="Error performing action 'delete' for iptables route: exit status 1; raw output: iptables: Bad rule (does a matching rule exist in that chain?).\n"
level=error time=2024-07-24T06:07:11Z msg="Error performing action 'delete' for iptables route: exit status 1; raw output: iptables: Bad rule (does a matching rule exist in that chain?).\n"
level=error time=2024-07-24T06:07:11Z msg="Error performing action 'delete' for iptables route: exit status 1; raw output: iptables: Bad rule (does a matching rule exist in that chain?).\n"
level=error time=2024-07-24T06:07:11Z msg="Error performing action 'delete' for iptables route: exit status 1; raw output: iptables: Bad rule (does a matching rule exist in that chain?).\n"

Can email complete logs if needed.

singholt commented 4 days ago

Hi @GlassBil , we currently do not offer an ECS Optimized AL2023 GPU AMI. Have you tried using the ECS Optimized AL2 (or AL2 5.10) GPU AMI?

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html

ethangeralt commented 4 days ago

Hello @GlassBil ,

I tested at my lab and I found workaround that may be dangerous to implement for any critical workload, but just in case if you want to test until something officially comes out, do let me know .