aws / amazon-ecs-ami

Packer recipes for building the official ECS-optimized Amazon Linux AMIs
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html
Apache License 2.0
205 stars 47 forks source link

`amzn2-ami-ecs-gpu-hvm` missing Nvida Driver #277

Closed jocampbell-exelixis closed 4 months ago

jocampbell-exelixis commented 4 months ago

Summary

Maybe I am doing something wrong but I thought the point of these amazon-linux2 images was that they had nvidia drivers pre-installed.

ami-088a209fd7cd0aaf9 amzn2-ami-ecs-gpu-hvm-2.0.20240424-x86_64-ebs

Description

The instance I am running on is part of AWS Batch and I was working on matching my CUDA install version with the pytorch version I am using and my job never started.. I went on to check and it seems either the driver failed to finish setting up or was never installed in the first place.

[root@ip-10-99-169-192 ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Expected Behavior

~ nvidia-smi
Wed Jul 24 14:43:37 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
...

Observed Behavior

[root@ip-10-99-169-192 ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Environment Details

[root@ip-10-99-169-192 ~]# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:1e.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
[root@ip-10-99-169-138 ~]# docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.0.0+unknown)

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 20.10.25
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: nvidia runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 64b8a811b07ba6288238eefc14d898ee0b5b99ba
 runc version: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 4.14.336-257.568.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 59.95GiB
 Name: ip-10-99-169-138.us-west-2.compute.internal
 ID: G5P3:CJUG:ZNH6:HUL5:UOFS:VHHQ:DCGT:TRJC:3FOR:VUXA:F6TR:GKF2
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 [root@ip-10-99-169-138 ~]# df -h
Filesystem                   Size  Used Avail Use% Mounted on
devtmpfs                      30G     0   30G   0% /dev
tmpfs                         30G     0   30G   0% /dev/shm
tmpfs                         30G  516K   30G   1% /run
tmpfs                         30G     0   30G   0% /sys/fs/cgroup
/dev/xvda1                    30G   11G   20G  36% /
10.99.170.177@tcp:/mhsljbev   12T  3.3T  7.9T  30% /fsx
overlay                       30G   11G   20G  36% /var/lib/docker/overlay2/8b15828628b306ac5e158f631c5234cfa5b94db128d3c69fb0a9a8a1966db4f5/merged
shm                           64M     0   64M   0% /var/lib/docker/containers/233c283c9bfe19a245465c591e82838b99a90c46788b7344b386c7b474dc80d1/mounts/shm
tmpfs                        6.0G     0  6.0G   0% /run/user/0
[root@ip-10-99-169-138 ~]# curl http://localhost:51678/v1/metadata
{"Cluster":"gpu-chemprop-environment_Batch_4c0c00c8-d83e-39bb-aa98-251139d32816","ContainerInstanceArn":"arn:aws:ecs:us-west-2:**************:container-instance/gpu-chemprop-environment_Batch_4c0c00c8-d83e-39bb-aa98-251139d32816/47caddf24cf044ca85c4adda896d3e0b","Version":"Amazon ECS Agent - v1.82.3 (*b702281f)"}

Supporting Log Snippets

[root@ip-10-99-169-138 ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

[root@ip-10-99-169-138 ~]# curl -O https://raw.githubusercontent.com/aws/amazon-ecs-logs-collector/master/ecs-logs-collector.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20906  100 20906    0     0   506k      0 --:--:-- --:--:-- --:--:--  510k
[root@ip-10-99-169-138 ~]# bash ecs-logs-collector.sh
Trying to check if the script is running as root ... ok
Trying to resolve instance-id ... getting instance id from ec2 metadata endpoint
ok
Trying to collect system information ... ok
Trying to check disk space usage ... ok
Trying to collect common operating system logs ... ok
Trying to collect kernel logs ... ok
Trying to get mount points and volume information ... ok
Trying to check SELinux status ... ok
Trying to get iptables list ... ok
Trying to detect installed packages ... ok
Trying to detect active system services list ... ok
Trying to gather Docker daemon information ... ok
Trying to inspect all Docker containers ... ok
Trying to collect Docker and containerd daemon logs ... ok
Trying to collect Docker systemd unit file ... ok
Trying to collect containerd systemd unit file ... ok
Trying to collect Docker sysconfig ... ok
Trying to collect Docker storage sysconfig ... ok
Trying to collect Docker daemon.json ... /etc/docker/daemon.json not found
Trying to collect Amazon ECS Container Agent logs ... ok
Trying to collect Amazon ECS Container Agent state and config ... Trying to collect Amazon ECS Container Agent engine data ... ok
Trying to get open files list ... ok
Trying to collect /etc/os-release ... ok
Trying to get uname kernel info ... ok
Trying to get dmidecode info ... ok
Trying to get lsmod info ... ok
Trying to collect systemd slice info ... ok
Trying to get veth info ... ok
Trying to get gpu info ... ok
Trying to archive gathered log information ... ok
jocampbell-exelixis commented 4 months ago

Problem was that P2 requires updated user data - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html#p2-instance