aws-samples / pixel-streaming-on-eks

MIT No Attribution
10 stars 4 forks source link

Packer unable to install nvidia drivers #2

Open AMGI-Pipeline opened 5 months ago

AMGI-Pipeline commented 5 months ago

Hi,

I tried deploying your demo by following the steps in the /docs folder. The Packer script threw an error when attempting to install the nvidia drivers. I'm including a section of the console output.

amazon-ebs: --2024-06-26 00:32:17-- https://developer.download.nvidia.com/compute/cuda/12.0.1/local_installers/cuda_12.0.1_525.85.12_linux.run amazon-ebs: Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.39.144 amazon-ebs: Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.39.144|:443... connected. amazon-ebs: HTTP request sent, awaiting response... 200 OK amazon-ebs: Length: 4207617207 (3.9G) [application/octet-stream] amazon-ebs: Saving to: ‘cuda_12.0.1_525.85.12_linux.run’ amazon-ebs: amazon-ebs: 100%[====================================>] 4,207,617,207 122MB/s in 25s amazon-ebs: amazon-ebs: 2024-06-26 00:32:41 (163 MB/s) - ‘cuda_12.0.1_525.85.12_linux.run’ saved [4207617207/4207617207] amazon-ebs: amazon-ebs: Installation failed. See log at /var/log/cuda-installer.log for details. Here is the command I executed: packer build eks-gpu-node.json

I'm not sure if I did something wrong, or if the problem is with out of date information on AWS, or something else. Any assistance is greatly appreciated!

AMGI-Pipeline commented 5 months ago

I disabled termination on the ec2 instance so that I could see the cuda-installer log. here is the log output: `[root@ip-172-31-0-179 log]# cat cuda-installer.log INFO: Setting silent=true INFO: Silent install of all components INFO: Driver not installed. INFO: Checking compiler version... INFO: gcc location: /bin/gcc

INFO: gcc version: gcc version 7.3.1 20180712 (Red Hat 7.3.1-17) (GCC)

INFO: Initializing menu

INFO: Components to install:

INFO: Executing NVIDIA-Linux-x86_64-525.85.12.run --ui=none --no-questions --accept-license --disable-nouveau --no-cc-version-check --install-libglvnd 2>&1 INFO: Finished with code: 256 [ERROR]: Install of driver component failed. Consult the driver log at /var/log/nvidia-installer.log for more details. [ERROR]: Install of 525.85.12 failed, quitting`

hsuzuki-gcs commented 3 weeks ago

I encountered the same issue. It seems to be an error related to the version of GCC. I was able to resolve it by modifying setup_gpu.sh as follows:

-sudo apt-get install -y gcc make linux-headers-$(uname -r)
+sudo apt-get install -y gcc gcc-12 make linux-headers-$(uname -r)
+sudo ln -s -f /usr/bin/gcc-12 /usr/bin/gcc
AMGI-Pipeline commented 3 weeks ago

@hsuzuki-gcs, Thank you for your reply. I will try making those modifications.