The instance I am running on is part of AWS Batch and I was working on matching my CUDA install version with the pytorch version I am using and my job never started.. I went on to check and it seems either the driver failed to finish setting up or was never installed in the first place.
[root@ip-10-99-169-192 ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
[root@ip-10-99-169-192 ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
[root@ip-10-99-169-138 ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
[root@ip-10-99-169-138 ~]# curl -O https://raw.githubusercontent.com/aws/amazon-ecs-logs-collector/master/ecs-logs-collector.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 20906 100 20906 0 0 506k 0 --:--:-- --:--:-- --:--:-- 510k
[root@ip-10-99-169-138 ~]# bash ecs-logs-collector.sh
Trying to check if the script is running as root ... ok
Trying to resolve instance-id ... getting instance id from ec2 metadata endpoint
ok
Trying to collect system information ... ok
Trying to check disk space usage ... ok
Trying to collect common operating system logs ... ok
Trying to collect kernel logs ... ok
Trying to get mount points and volume information ... ok
Trying to check SELinux status ... ok
Trying to get iptables list ... ok
Trying to detect installed packages ... ok
Trying to detect active system services list ... ok
Trying to gather Docker daemon information ... ok
Trying to inspect all Docker containers ... ok
Trying to collect Docker and containerd daemon logs ... ok
Trying to collect Docker systemd unit file ... ok
Trying to collect containerd systemd unit file ... ok
Trying to collect Docker sysconfig ... ok
Trying to collect Docker storage sysconfig ... ok
Trying to collect Docker daemon.json ... /etc/docker/daemon.json not found
Trying to collect Amazon ECS Container Agent logs ... ok
Trying to collect Amazon ECS Container Agent state and config ... Trying to collect Amazon ECS Container Agent engine data ... ok
Trying to get open files list ... ok
Trying to collect /etc/os-release ... ok
Trying to get uname kernel info ... ok
Trying to get dmidecode info ... ok
Trying to get lsmod info ... ok
Trying to collect systemd slice info ... ok
Trying to get veth info ... ok
Trying to get gpu info ... ok
Trying to archive gathered log information ... ok
Summary
Maybe I am doing something wrong but I thought the point of these amazon-linux2 images was that they had nvidia drivers pre-installed.
ami-088a209fd7cd0aaf9
amzn2-ami-ecs-gpu-hvm-2.0.20240424-x86_64-ebs
Description
The instance I am running on is part of AWS Batch and I was working on matching my CUDA install version with the pytorch version I am using and my job never started.. I went on to check and it seems either the driver failed to finish setting up or was never installed in the first place.
Expected Behavior
Observed Behavior
Environment Details
Supporting Log Snippets