amazonlinux / amazon-linux-2023

Amazon Linux 2023
https://aws.amazon.com/linux/amazon-linux-2023/
Other
523 stars 39 forks source link

[Bug] - Cannot install nvidia driver with most recent ami #538

Closed ksebby closed 11 months ago

ksebby commented 11 months ago

Describe the bug I used to have no problem installing nvidia drivers on G4dn instances with Amazon Linux 2023 (ami-0df435f331839b2d6, al2023-ami-2023.2.20231016.0-kernel-6.1-x86_64) but with the newer ami (ami-01bc990364452ab3e, al2023-ami-2023.2.20231026.0-kernel-6.1-x86_64) I cannot get the NVIDIA drivers to install.

To Reproduce These are the commands I use to install the NVIDIA drivers which work with the 20231016.0 kernel but not the 20231026.0 kernel. As root:

- yum install -y cmake gcc docker kernel-devel-$(uname -r)
- BASE_URL=https://us.download.nvidia.com/tesla
- DRIVER_VERSION=515.105.01
- curl -fSsl -O $BASE_URL/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
- chmod +x NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
- ./NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
- curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | tee /etc/yum.repos.d/nvidia-container-toolkit.repo
- yum install -y nvidia-container-toolkit
- nvidia-ctk runtime configure --runtime=docker
- systemctl restart docker

Expected behavior Expect the ./NVIDIA-Linux-x86_64-$DRIVER_VERSION.run to run successfully but it fails with the error

ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA
kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.

The logs do contain the same information.

Additional information I have ensured that there are no other GPU drivers installed and the NVIDIA device is recognized by the system.

lucafrost commented 11 months ago

Also having this issue, error log is identical to your comment.

ozbenh commented 11 months ago

Try sudo dnf install kernel-modules-extra and try again. We moved some kernel drivers that aren't normally needed on EC2 into that optional package (in part to save space on AMIs, in part because this was the opportunity for adding things we don't want in the base setup, such as the EFI framebuffer, which are increasing boot/launch time, but are wanted by some customers). We didn't realize that newer nvidia drivers depended in DRM/gem

chrd5273 commented 11 months ago

I had the same exact issue, and what @ozbenh suggested solved the issue.

ksebby commented 11 months ago

Thanks. sudo dnf install kernel-modules-extra did it for me.

97amarnathk commented 9 months ago

This worked for me as well.

Aadil-5122 commented 8 months ago

This worked for me as well! Thanks Alot!

elkay commented 7 months ago

Try sudo dnf install kernel-modules-extra and try again. We moved some kernel drivers that aren't normally needed on EC2 into that optional package (in part to save space on AMIs, in part because this was the opportunity for adding things we don't want in the base setup, such as the EFI framebuffer, which are increasing boot/launch time, but are wanted by some customers). We didn't realize that newer nvidia drivers depended in DRM/gem

How is this so buried? (and praise Google for magically pulling this thread up) Just setting up a new g5g instance and was getting the dreaded "Unable to load the kernel module 'nvidia.ko'" during driver installation. Without that command I would have probably given up, as nothing else was working.

PriyaranjanMarathe commented 7 months ago

ozbenh suggestion worked for me as well. Thanks for sharing!

PriyaranjanMarathe commented 7 months ago

https://ranjanmarathe.wordpress.com/2024/03/03/unable-to-load-the-kernel-module-nvidia-ko/

bostrt commented 6 months ago

sudo dnf install kernel-modules-extra also helped me with Amazon Linux 2023 on g4dn.xlarge. Can the docs at https://docs.aws.amazon.com/en_us/AWSEC2/latest/UserGuide/install-nvidia-driver.html be updated?

Kontinuation commented 5 months ago

sudo dnf install kernel-modules-extra also helped me installing NVIDIA-SMI 535.129.03 on Amazon Linux 2023 on g5.xlarge.

I also hope that https://docs.aws.amazon.com/en_us/AWSEC2/latest/UserGuide/install-nvidia-driver.html could be updated, so that newcomers won't have to spend time searching for solutions.

dantaninecz commented 3 months ago

sudo dnf install kernel-modules-extra fixed the issue for me as well. Great piece of info that could probably be more prominently conveyed to users.

limmike commented 3 months ago

NVIDIA has added AL2023 support

Refer to How do I install NVIDIA GPU driver, CUDA toolkit and optionally NVIDIA Container Toolkit in Amazon Linux 2023 (AL2023)?