aws / amazon-ecs-ami

Packer recipes for building the official ECS-optimized Amazon Linux AMIs
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html
Apache License 2.0
202 stars 47 forks source link

Amazon Linux 2 GPU-optimized AMI - nvidia-persistenced.service fails to start and logs error "NVRM: API mismatch", breaking the ECS-agent #256

Closed truenorth8 closed 5 months ago

truenorth8 commented 5 months ago

Summary

The "amazon-linux-2/kernel-5.10/gpu" image encounters errors in the NVidia driver on startup, causing the ECS agent to fail.

[  145.441350] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal systemd[1]: Unit nvidia-persistenced.service entered failed state.
[  145.441505] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service failed.
[  145.441660] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal kernel: NVRM: API mismatch: the client has the version 535.183.01, but
[  145.441820] cloud-init[3620]: NVRM: this kernel module has the version 535.161.07.  Please
[  145.441978] cloud-init[3620]: NVRM: make sure that this kernel module and all NVIDIA driver
[  145.442130] cloud-init[3620]: NVRM: components have the same version.

Description

I'm running ECS on EC2 with g4dn instances. I use amazon-linux-2/kernel-5.10/gpu/recommended which is deployed using cdk. At the of writing, this resolved to kernel 5.10.217-205.860.amzn2.x86_64 (see docker stats below for more details)

    const amiId = ssm.StringParameter.valueForStringParameter(this,
      '/aws/service/ecs/optimized-ami/amazon-linux-2/kernel-5.10/gpu/recommended/image_id');
    const machineImage = ec2.MachineImage.genericLinux({
      'us-east-1': amiId,
    });

    // ECS service with EC2 capacity provider and AutoScalingGroup

On 11 June 2024. ~7am UTC I deployed a new version of my app. This deploy terminated existing instances and replaced them with new ones (intended). However, the new EC2 instances are show nvidia errors in the System log that didn't appear before, causing the ECS agent to fail. Meaning the agent does not register itself with ECS, and does not launch containers.

It's clear from the logs that the error is related to the NVidia drivers. The last working deploy was at 1 day earlier, 10 June 2024 ~8am UTC.

I also run a userdata script on instance startup, though the error seems to occur before this script runs. And I'm not modifying the NVidia drivers, at least not intentionally. The lines above "install aws-cli" were added automatically by cdk.

#!/bin/bash
echo ECS_CLUSTER=<redacted> >> /etc/ecs/ecs.config
sudo iptables --insert FORWARD 1 --in-interface docker+ --destination 169.254.169.254/32 --jump DROP
sudo service iptables save
echo ECS_AWSVPC_BLOCK_IMDS=true >> /etc/ecs/ecs.config

# install aws-cli
yum update -y && yum install -y aws-cli

echo ECS_ENABLE_CONTAINER_METADATA=true >> /etc/ecs/ecs.config
echo ECS_RESERVED_MEMORY=256 >> /etc/ecs/ecs.config

# make nvidia runtime available to all containers by default
sed -i 's/^OPTIONS="/OPTIONS="--default-runtime nvidia /' /etc/sysconfig/docker

# print docker settings
echo "---docker settings:\n\n"
cat /etc/sysconfig/docker
echo "\n\n end of docker settings---"

# apply settings
systemctl restart docker
echo "docker OK"

Expected Behavior

The instance launches without driver errors.

Observed Behavior

The instance logs errors related to the NVidia driver, and the ECS-agent doesn't function normally.

Environment Details

Kernel module versions

$ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.161.07 Sat Feb 17 22:55:48 UTC 2024 GCC version: gcc version 10.5.0 20230707 (Red Hat 10.5.0-1) (GCC)

$ cat /sys/module/nvidia/version 535.161.07

$ dkms status nvidia, 535.183.01: added

docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc., v0.0.0+unknown)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 20.10.25
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: 64b8a811b07ba6288238eefc14d898ee0b5b99ba
 runc version: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.10.217-205.860.amzn2.x86_64
 Operating System: Amazon Linux 2
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 30.89GiB
 Name: ip-10-0-19-159.ec2.internal
 ID: JYX3:CPYG:DNMS:CABR:ASOG:EQXO:WWQ6:5UWL:4MVO:77EB:42ZT:JU7Z
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Supporting Log Snippets

[  145.441350] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal systemd[1]: Unit nvidia-persistenced.service entered failed state.
[  145.441505] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service failed.
[  145.441660] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal kernel: NVRM: API mismatch: the client has the version 535.183.01, but
[  145.441820] cloud-init[3620]: NVRM: this kernel module has the version 535.161.07.  Please
[  145.441978] cloud-init[3620]: NVRM: make sure that this kernel module and all NVIDIA driver
[  145.442130] cloud-init[3620]: NVRM: components have the same version.
[  145.442286] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal cloud-init[3620]: Cleanup    : 3:nvidia-persistenced-latest-dkms-535.161.07-1.el7.x86_6   24/32
[  145.442442] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal cloud-init[3620]: Cleanup    : 3:nvidia-driver-latest-dkms-cuda-535.161.07-1.el7.x86_64   25/32
[  145.480428] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal cloud-init[3620]: Cleanup    : 3:nvidia-driver-latest-dkms-cuda-libs-535.161.07-1.el7.x   26/32
[  145.480573] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal cloud-init[3620]: Cleanup    : 3:nvidia-modprobe-latest-dkms-535.161.07-1.el7.x86_64      27/32
[  145.480723] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal cloud-init[3620]: Cleanup    : 3:nvidia-xconfig-latest-dkms-535.161.07-1.el7.x86_64       28/32
[  145.480889] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal cloud-init[3620]: Cleanup    : 3:nvidia-driver-latest-dkms-535.161.07-1.el7.x86_64        29/32
[  145.481041] cloud-init[3620]: Jun 11 08:11:03 ip-10-0-19-159.ec2.internal cloud-init[3620]: Cleanup    : 3:nvidia-driver-latest-dkms-devel-535.161.07-1.el7.x86_6   30/32
[  145.481219] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Cleanup    : 3:nvidia-driver-latest-dkms-libs-535.161.07-1.el7.x86_64   31/32
[  145.481352] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Cleanup    : nvidia-fabric-manager-535.161.07-1.x86_64                  32/32
[  145.481511] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service holdoff time over, scheduling restart.
[  145.481668] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Stopped NVIDIA Persistence Daemon.
[  145.481827] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has finished shutting down
[  145.481990] cloud-init[3620]: -- Defined-By: systemd
[  145.482144] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.482300] cloud-init[3620]: --
[  145.482458] cloud-init[3620]: -- Unit nvidia-persistenced.service has finished shutting down.
[  145.482627] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Starting NVIDIA Persistence Daemon...
[  145.482780] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has begun start-up
[  145.482945] cloud-init[3620]: -- Defined-By: systemd
[  145.483104] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.483266] cloud-init[3620]: --
[  145.483424] cloud-init[3620]: -- Unit nvidia-persistenced.service has begun starting up.
[  145.483582] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Stopped Dynamically Generate Message Of The Day.
[  145.483753] cloud-init[3620]: -- Subject: Unit update-motd.service has finished shutting down
[  145.483909] cloud-init[3620]: -- Defined-By: systemd
[  145.484221] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.484386] cloud-init[3620]: --
[  145.484547] cloud-init[3620]: -- Unit update-motd.service has finished shutting down.
[  145.484712] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25461]: Verbose syslog connection opened
[  145.484867] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Stopping Dynamically Generate Message Of The Day...
[  145.485023] cloud-init[3620]: -- Subject: Unit update-motd.service has begun shutting down
[  145.485177] cloud-init[3620]: -- Defined-By: systemd
[  145.485338] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.485487] cloud-init[3620]: --
[  145.485644] cloud-init[3620]: -- Unit update-motd.service has begun shutting down.
[  145.485800] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25461]: Started (25461)
[  145.485955] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Starting Dynamically Generate Message Of The Day...
[  145.486108] cloud-init[3620]: -- Subject: Unit update-motd.service has begun start-up
[  145.486283] cloud-init[3620]: -- Defined-By: systemd
[  145.486419] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.486578] cloud-init[3620]: --
[  145.486735] cloud-init[3620]: -- Unit update-motd.service has begun starting up.
[  145.486894] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-devel-535.183.01-1.el7.x86_6    1/32
[  145.487048] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25461]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
[FAILED] Failed to start Amazon Elastic Container Service - container agent.
[  145.525034] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service: control process exited, code=exited status=1
[  145.525160] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25460]: nvidia-persistenced failed to initialize. Check syslog for more details.
See 'systemctl status ecs.service' for details.
[  145.525315] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25461]: PID file unlocked.
[  145.525470] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25461]: PID file closed.
[  145.525624] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25461]: Shutdown (25461)
[  145.525785] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Failed to start NVIDIA Persistence Daemon.
[  145.525938] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has failed
[  145.526091] cloud-init[3620]: -- Defined-By: systemd
[  145.526255] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.526406] cloud-init[3620]: --
[  145.526566] cloud-init[3620]: -- Unit nvidia-persistenced.service has failed.
[  145.526727] cloud-init[3620]: --
[  145.526887] cloud-init[3620]: -- The result is failed.
[  145.527040] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Unit nvidia-persistenced.service entered failed state.
[  145.527202] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service failed.
[  145.527358] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal kernel: NVRM: API mismatch: the client has the version 535.183.01, but
[  145.527517] cloud-init[3620]: NVRM: this kernel module has the version 535.161.07.  Please
[  145.527668] cloud-init[3620]: NVRM: make sure that this kernel module and all NVIDIA driver
[  145.527833] cloud-init[3620]: NVRM: components have the same version.
[  145.528251] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : cuda-drivers-535.183.01-1.x86_64                            2/32
[  145.528459] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-535.183.01-1.el7.x86_64         3/32
[  145.528613] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:kmod-nvidia-latest-dkms-535.183.01-1.el7.x86_64           4/32
[  145.528770] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-modprobe-latest-dkms-535.183.01-1.el7.x86_64       5/32
[  145.528928] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-settings-535.183.01-2.el7.x86_64                   6/32
[  145.529081] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-libXNVCtrl-535.183.01-2.el7.x86_64                 7/32
[  145.529249] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-cuda-535.183.01-1.el7.x86_64    8/32
[  145.529420] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-libs-535.183.01-1.el7.x86_64    9/32
[  145.529574] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-xconfig-latest-dkms-535.183.01-1.el7.x86_64       10/32
[  145.529727] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-NVML-535.183.01-1.el7.x86_64   11/32
[  145.529883] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-libXNVCtrl-devel-535.183.01-2.el7.x86_64          12/32
[  145.530039] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-persistenced-latest-dkms-535.183.01-1.el7.x86_6   13/32
[  145.568060] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : nvidia-fabric-manager-535.183.01-1.x86_64                  14/32
[  145.568202] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-cuda-libs-535.183.01-1.el7.x   15/32
[  145.568539] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-NvFBCOpenGL-535.183.01-1.el7   16/32
[  145.568670] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-settings-535.161.07-1.el7.x86_64                  17/32
[  145.568830] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : nvidia-fabric-manager-535.161.07-1.x86_64                  18/32
[  145.568979] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-persistenced-latest-dkms-535.161.07-1.el7.x86_6   19/32
[  145.569135] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:kmod-nvidia-latest-dkms-535.161.07-1.el7.x86_64          20/32
[  145.569314] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-xconfig-latest-dkms-535.161.07-1.el7.x86_64       21/32
[  145.569468] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-modprobe-latest-dkms-535.161.07-1.el7.x86_64      22/32
[  145.569622] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-libXNVCtrl-devel-535.161.07-1.el7.x86_64          23/32
[  145.569776] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : cuda-drivers-535.161.07-1.x86_64                           24/32
[  145.569938] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-cuda-535.161.07-1.el7.x86_64   25/32
[  145.570088] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-devel-535.161.07-1.el7.x86_6   26/32
[  145.570248] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-libXNVCtrl-535.161.07-1.el7.x86_64                27/32
[  145.570411] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-535.161.07-1.el7.x86_64        28/32
[  145.570561] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-libs-535.161.07-1.el7.x86_64   29/32
[  145.570883] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-NvFBCOpenGL-535.161.07-1.el7   30/32
[  145.571034] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-cuda-libs-535.161.07-1.el7.x   31/32
[  145.571185] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Verifying  : 3:nvidia-driver-latest-dkms-NVML-535.161.07-1.el7.x86_64   32/32
[  145.571345] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Installed:
[  145.571499] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-libXNVCtrl.x86_64 3:535.183.01-2.el7
[  145.571652] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-libXNVCtrl-devel.x86_64 3:535.183.01-2.el7
[  145.571809] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-settings.x86_64 3:535.183.01-2.el7
[  145.571972] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Updated:
[  145.572280] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: cuda-drivers.x86_64 0:535.183.01-1
[  145.572437] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: kmod-nvidia-latest-dkms.x86_64 3:535.183.01-1.el7
[  145.572593] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-driver-latest-dkms.x86_64 3:535.183.01-1.el7
[  145.610418] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-driver-latest-dkms-NVML.x86_64 3:535.183.01-1.el7
[  145.610573] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-driver-latest-dkms-NvFBCOpenGL.x86_64 3:535.183.01-1.el7
[  145.610726] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-driver-latest-dkms-cuda.x86_64 3:535.183.01-1.el7
[  145.610882] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-driver-latest-dkms-cuda-libs.x86_64 3:535.183.01-1.el7
[  145.611042] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-driver-latest-dkms-devel.x86_64 3:535.183.01-1.el7
[  145.611200] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-driver-latest-dkms-libs.x86_64 3:535.183.01-1.el7
[  145.611355] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-fabric-manager.x86_64 0:535.183.01-1
[  145.611533] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-modprobe-latest-dkms.x86_64 3:535.183.01-1.el7
[  145.611664] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-persistenced-latest-dkms.x86_64 3:535.183.01-1.el7
[  145.611819] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-xconfig-latest-dkms.x86_64 3:535.183.01-1.el7
[  145.611974] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Replaced:
[  145.612292] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-libXNVCtrl.x86_64 3:535.161.07-1.el7
[  145.612445] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-libXNVCtrl-devel.x86_64 3:535.161.07-1.el7
[  145.612604] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: nvidia-settings.x86_64 3:535.161.07-1.el7
[  145.612764] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Complete!
[  145.612916] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service holdoff time over, scheduling restart.
[  145.613076] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Stopped NVIDIA Persistence Daemon.
[  145.613226] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has finished shutting down
[  145.613386] cloud-init[3620]: -- Defined-By: systemd
[  145.613539] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.613697] cloud-init[3620]: --
[  145.613851] cloud-init[3620]: -- Unit nvidia-persistenced.service has finished shutting down.
[  145.614009] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Starting NVIDIA Persistence Daemon...
[  145.614164] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has begun start-up
[  145.614323] cloud-init[3620]: -- Defined-By: systemd
[  145.614489] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.614650] cloud-init[3620]: --
[  145.614800] cloud-init[3620]: -- Unit nvidia-persistenced.service has begun starting up.
[  145.614961] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25495]: Verbose syslog connection opened
[  145.615115] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25495]: Started (25495)
[  145.615274] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25495]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
[  145.615450] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25495]: PID file unlocked.
[  145.615585] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25494]: nvidia-persistenced failed to initialize. Check syslog for more details.
[  145.615740] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal kernel: NVRM: API mismatch: the client has the version 535.183.01, but
[  145.615895] cloud-init[3620]: NVRM: this kernel module has the version 535.161.07.  Please
[  145.654035] cloud-init[3620]: NVRM: make sure that this kernel module and all NVIDIA driver
[  145.654191] cloud-init[3620]: NVRM: components have the same version.
[  145.654343] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25495]: PID file closed.
[  145.654511] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service: control process exited, code=exited status=1
[  145.654664] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25495]: Shutdown (25495)
[  145.654822] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Failed to start NVIDIA Persistence Daemon.
[  145.654980] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has failed
[  145.655134] cloud-init[3620]: -- Defined-By: systemd
[  145.655294] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.655449] cloud-init[3620]: --
[  145.655609] cloud-init[3620]: -- Unit nvidia-persistenced.service has failed.
[  145.655769] cloud-init[3620]: --
[  145.655924] cloud-init[3620]: -- The result is failed.
[  145.656235] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Unit nvidia-persistenced.service entered failed state.
[  145.656394] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service failed.
[  145.656554] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: Loaded plugins: dkms-build-requires, nvidia, priorities, update-motd, upgrade-
[  145.656707] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal cloud-init[3620]: : helper, versionlock
[  145.656862] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service holdoff time over, scheduling restart.
[  145.657020] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Stopped NVIDIA Persistence Daemon.
[  145.657177] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has finished shutting down
[  145.657331] cloud-init[3620]: -- Defined-By: systemd
[  145.657515] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.657647] cloud-init[3620]: --
[  145.657802] cloud-init[3620]: -- Unit nvidia-persistenced.service has finished shutting down.
[  145.657956] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Starting NVIDIA Persistence Daemon...
[  145.658114] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has begun start-up
[  145.658270] cloud-init[3620]: -- Defined-By: systemd
[  145.658426] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.658595] cloud-init[3620]: --
[  145.658749] cloud-init[3620]: -- Unit nvidia-persistenced.service has begun starting up.
[  145.658902] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25501]: Verbose syslog connection opened
[  145.659065] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25501]: Started (25501)
[  145.659219] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25501]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
[  145.659375] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25501]: PID file unlocked.
[  145.659530] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25500]: nvidia-persistenced failed to initialize. Check syslog for more details.
[  145.659685] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25501]: PID file closed.
[  145.659840] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25501]: Shutdown (25501)
[  145.660012] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service: control process exited, code=exited status=1
[  145.660156] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Failed to start NVIDIA Persistence Daemon.
[  145.660309] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has failed
[  145.660465] cloud-init[3620]: -- Defined-By: systemd
[  145.660626] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.698713] cloud-init[3620]: --
[  145.698863] cloud-init[3620]: -- Unit nvidia-persistenced.service has failed.
[  145.699023] cloud-init[3620]: --
[  145.699179] cloud-init[3620]: -- The result is failed.
[  145.699337] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal kernel: NVRM: API mismatch: the client has the version 535.183.01, but
[  145.699497] cloud-init[3620]: NVRM: this kernel module has the version 535.161.07.  Please
[  145.699678] cloud-init[3620]: NVRM: make sure that this kernel module and all NVIDIA driver
[  145.699813] cloud-init[3620]: NVRM: components have the same version.
[  145.699972] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Unit nvidia-persistenced.service entered failed state.
[  145.700292] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service failed.
[  145.700448] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service holdoff time over, scheduling restart.
[  145.700603] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Stopped NVIDIA Persistence Daemon.
[  145.700768] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has finished shutting down
[  145.700918] cloud-init[3620]: -- Defined-By: systemd
[  145.701076] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.701232] cloud-init[3620]: --
[  145.701393] cloud-init[3620]: -- Unit nvidia-persistenced.service has finished shutting down.
[  145.701551] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Starting NVIDIA Persistence Daemon...
[  145.701709] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has begun start-up
[  145.701876] cloud-init[3620]: -- Defined-By: systemd
[  145.702035] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.702192] cloud-init[3620]: --
[  145.702346] cloud-init[3620]: -- Unit nvidia-persistenced.service has begun starting up.
[  145.702527] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25515]: Verbose syslog connection opened
[  145.702693] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25515]: Started (25515)
[  145.702859] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25515]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.
[  145.703021] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25515]: PID file unlocked.
[  145.703185] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25514]: nvidia-persistenced failed to initialize. Check syslog for more details.
[  145.703347] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25515]: PID file closed.
[  145.703543] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service: control process exited, code=exited status=1
[  145.703688] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal nvidia-persistenced[25515]: Shutdown (25515)
[  145.703848] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Failed to start NVIDIA Persistence Daemon.
[  145.704022] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has failed
[  145.704180] cloud-init[3620]: -- Defined-By: systemd
[  145.704337] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.704502] cloud-init[3620]: --
[  145.704666] cloud-init[3620]: -- Unit nvidia-persistenced.service has failed.
[  145.704852] cloud-init[3620]: --
[  145.704980] cloud-init[3620]: -- The result is failed.
[  145.705142] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal kernel: NVRM: API mismatch: the client has the version 535.183.01, but
[  145.705295] cloud-init[3620]: NVRM: this kernel module has the version 535.161.07.  Please
[  145.705454] cloud-init[3620]: NVRM: make sure that this kernel module and all NVIDIA driver
[  145.705612] cloud-init[3620]: NVRM: components have the same version.
[  145.705770] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: Unit nvidia-persistenced.service entered failed state.
[  145.705931] cloud-init[3620]: Jun 11 08:11:04 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service failed.
[  145.743869] cloud-init[3620]: Jun 11 08:11:05 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service holdoff time over, scheduling restart.
[  145.744039] cloud-init[3620]: Jun 11 08:11:05 ip-10-0-19-159.ec2.internal systemd[1]: Stopped NVIDIA Persistence Daemon.
[  145.744179] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has finished shutting down
[  145.744340] cloud-init[3620]: -- Defined-By: systemd
[  145.744501] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.744658] cloud-init[3620]: --
[  145.744809] cloud-init[3620]: -- Unit nvidia-persistenced.service has finished shutting down.
[  145.744968] cloud-init[3620]: Jun 11 08:11:05 ip-10-0-19-159.ec2.internal systemd[1]: start request repeated too quickly for nvidia-persistenced.service
[  145.745123] cloud-init[3620]: Jun 11 08:11:05 ip-10-0-19-159.ec2.internal systemd[1]: Failed to start NVIDIA Persistence Daemon.
[  145.745284] cloud-init[3620]: -- Subject: Unit nvidia-persistenced.service has failed
[  145.745438] cloud-init[3620]: -- Defined-By: systemd
[  145.745595] cloud-init[3620]: -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
[  145.745751] cloud-init[3620]: --
[  145.745932] cloud-init[3620]: -- Unit nvidia-persistenced.service has failed.
[  145.746072] cloud-init[3620]: --
[  145.746224] cloud-init[3620]: -- The result is failed.
[  145.746379] cloud-init[3620]: Jun 11 08:11:05 ip-10-0-19-159.ec2.internal systemd[1]: Unit nvidia-persistenced.service entered failed state.
[  145.746541] cloud-init[3620]: Jun 11 08:11:05 ip-10-0-19-159.ec2.internal systemd[1]: nvidia-persistenced.service failed.

...
(some logs related to user-startup omitted because they contain sensitive information)
...

[  146.047052] cloud-init[3620]: invalid_payloadCloud-init v. 19.3-46.amzn2.0.2 finished at Tue, 11 Jun 2024 08:11:11 +0000. Datasource DataSourceEc2.  Up 144.29 seconds
[  OK  ] Started Dynamically Generate Message Of The Day.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Cloud-init target.
[  OK  ] Reached target Graphical Interface.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Started Update UTMP about System Runlevel Changes.

[  156.380043] NVRM: API mismatch: the client has the version 535.183.01, but
[  156.380043] NVRM: this kernel module has the version 535.161.07.  Please
[  156.380043] NVRM: make sure that this kernel module and all NVIDIA driver
[  156.380043] NVRM: components have the same version.
[  166.624266] NVRM: API mismatch: the client has the version 535.183.01, but
[  166.624266] NVRM: this kernel module has the version 535.161.07.  Please
[  166.624266] NVRM: make sure that this kernel module and all NVIDIA driver
[  166.624266] NVRM: components have the same version.
[  176.872003] NVRM: API mismatch: the client has the version 535.183.01, but
[  176.872003] NVRM: this kernel module has the version 535.161.07.  Please
[  176.872003] NVRM: make sure that this kernel module and all NVIDIA driver
[  176.872003] NVRM: components have the same version.
[  187.128036] NVRM: API mismatch: the client has the version 535.183.01, but
[  187.128036] NVRM: this kernel module has the version 535.161.07.  Please
[  187.128036] NVRM: make sure that this kernel module and all NVIDIA driver
[  187.128036] NVRM: components have the same version.
[  197.380133] NVRM: API mismatch: the client has the version 535.183.01, but
[  197.380133] NVRM: this kernel module has the version 535.161.07.  Please
[  197.380133] NVRM: make sure that this kernel module and all NVIDIA driver
[  197.380133] NVRM: components have the same version.
[  207.628066] NVRM: API mismatch: the client has the version 535.183.01, but
[  207.628066] NVRM: this kernel module has the version 535.161.07.  Please
[  207.628066] NVRM: make sure that this kernel module and all NVIDIA driver
[  207.628066] NVRM: components have the same version.
[  217.879910] NVRM: API mismatch: the client has the version 535.183.01, but
[  217.879910] NVRM: this kernel module has the version 535.161.07.  Please
[  217.879910] NVRM: make sure that this kernel module and all NVIDIA driver
[  217.879910] NVRM: components have the same version.
[  228.124050] NVRM: API mismatch: the client has the version 535.183.01, but
[  228.124050] NVRM: this kernel module has the version 535.161.07.  Please
[  228.124050] NVRM: make sure that this kernel module and all NVIDIA driver
[  228.124050] NVRM: components have the same version.
[  238.376258] NVRM: API mismatch: the client has the version 535.183.01, but
[  238.376258] NVRM: this kernel module has the version 535.161.07.  Please
[  238.376258] NVRM: make sure that this kernel module and all NVIDIA driver
[  238.376258] NVRM: components have the same version.
[  248.628033] NVRM: API mismatch: the client has the version 535.183.01, but
[  248.628033] NVRM: this kernel module has the version 535.161.07.  Please
[  248.628033] NVRM: make sure that this kernel module and all NVIDIA driver
[  248.628033] NVRM: components have the same version.
[  258.876087] NVRM: API mismatch: the client has the version 535.183.01, but
[  258.876087] NVRM: this kernel module has the version 535.161.07.  Please
[  258.876087] NVRM: make sure that this kernel module and all NVIDIA driver
[  258.876087] NVRM: components have the same version.
[  269.128070] NVRM: API mismatch: the client has the version 535.183.01, but
[  269.128070] NVRM: this kernel module has the version 535.161.07.  Please
[  269.128070] NVRM: make sure that this kernel module and all NVIDIA driver
[  269.128070] NVRM: components have the same version.
[  279.380126] NVRM: API mismatch: the client has the version 535.183.01, but
[  279.380126] NVRM: this kernel module has the version 535.161.07.  Please
[  279.380126] NVRM: make sure that this kernel module and all NVIDIA driver
[  279.380126] NVRM: components have the same version.
[  289.624215] NVRM: API mismatch: the client has the version 535.183.01, but
[  289.624215] NVRM: this kernel module has the version 535.161.07.  Please
[  289.624215] NVRM: make sure that this kernel module and all NVIDIA driver
[  289.624215] NVRM: components have the same version.
[  299.880090] NVRM: API mismatch: the client has the version 535.183.01, but
[  299.880090] NVRM: this kernel module has the version 535.161.07.  Please
[  299.880090] NVRM: make sure that this kernel module and all NVIDIA driver
[  299.880090] NVRM: components have the same version.
[  310.124053] NVRM: API mismatch: the client has the version 535.183.01, but
[  310.124053] NVRM: this kernel module has the version 535.161.07.  Please
[  310.124053] NVRM: make sure that this kernel module and all NVIDIA driver
[  310.124053] NVRM: components have the same version.
[  320.380096] NVRM: API mismatch: the client has the version 535.183.01, but
[  320.380096] NVRM: this kernel module has the version 535.161.07.  Please
[  320.380096] NVRM: make sure that this kernel module and all NVIDIA driver
[  320.380096] NVRM: components have the same version.
[  330.628039] NVRM: API mismatch: the client has the version 535.183.01, but
[  330.628039] NVRM: this kernel module has the version 535.161.07.  Please
[  330.628039] NVRM: make sure that this kernel module and all NVIDIA driver
[  330.628039] NVRM: components have the same version.
[  340.880010] NVRM: API mismatch: the client has the version 535.183.01, but
[  340.880010] NVRM: this kernel module has the version 535.161.07.  Please
[  340.880010] NVRM: make sure that this kernel module and all NVIDIA driver
[  340.880010] NVRM: components have the same version.
[  351.124126] NVRM: API mismatch: the client has the version 535.183.01, but
[  351.124126] NVRM: this kernel module has the version 535.161.07.  Please
[  351.124126] NVRM: make sure that this kernel module and all NVIDIA driver
[  351.124126] NVRM: components have the same version.

Please let me know if there's anything I can do to prevent this issue from happening in the future.

prateekchaudhry commented 5 months ago

Hi @truenorth8 , thanks for reporting the issue. May I know what AMI is being used by your deployment? As a verification, I just ran some sanity tests with 20240610 Kernel 5.10 GPU AMI, and the agent and GPUs seem to work OK. Could you verify that you have this AMI in your deployment?

Also, normally ECS AMIs should not be seeing version mismatch between drivers and client and these should not be updated during instance launches. So I do not expect nvidia driver related logs in cloud-init-output. Could you verify if you might have nvidia updates enabled? We recommend you have them disabled (these are disabled in base Amazon Linux AMIs and ECS AMIs) as upstream Amazon Linux updates might trigger failures

truenorth8 commented 5 months ago

@prateekchaudhry yum update -y was the culprit, removing it caused the issue to go away