NVIDIA / nvidia-installer

NVIDIA driver installer
GNU General Public License v2.0
131 stars 27 forks source link

"Building kernel modules" very slow / never completes #28

Closed jfirebaugh closed 3 months ago

jfirebaugh commented 1 year ago

I'm trying to run nvidia-installer on an AWS g4dn.xlarge instance. The script I'm using is:

sudo yum install -y gcc kernel-devel-$(uname -r)
BASE_URL=https://us.download.nvidia.com/tesla
DRIVER_VERSION=525.85.12
curl -fSsl -O $BASE_URL/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
sudo sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run --ui=none

The "Building kernel modules" step of the install reaches 50% quickly, then gets progressively slower and slower, before seeming to hang at 100%. This was reproduced by an AWS support technician, who was using ami-02771ba9f2783d0a in us-west-2. For him, the install did finally complete after approximately 1 hour. I have not been successful (got disconnected from the instance after approximately 40 minutes).

During the duration, the nvidia-installer process consistently consumes 100% of one CPU. Here is a sample of strace output from the process.

Is it expected that installation takes this long? It's way more than what I would expect for building some kernel modules.

aritger commented 1 year ago

Hi. Here are a few ideas:

I hope that helps.

jfirebaugh commented 1 year ago

I got further with this by trying older versions. One of them (I forget which) told me that my GCC version was mismatched with my kernel version. 525.85.12 didn't give me that warning (even with the same kernel and same GCC), but when I added CC=gcc10-cc it did build in a reasonable amount of time. My guess is that building with default GCC spews a lot of errors and/or warnings to the output, and nvidia-install doesn't deal with a large amount of output efficiently (from looking at the source code I see pipes are involved).

dadap commented 1 year ago

Do you observe any difference between building with one compiler or the other if you build the kernel modules outside of nvidia-installer? You can do this by following the below procedure:

  1. Extract the contents of the .run file by executing it with the -x option on its command line. This will create a new directory with the same name as the .run file (minus the .run extension) and extract the installer package files to it.
  2. Change to the “kernel” directory in the just created extracted package directory.
  3. run make: you can override the compiler by setting CC in the environment or on make’s command line.

I wouldn’t expect copious build output to slow the installer down to the point where it doesn’t complete after an hour. I have seen occasional cases where nvidia-installer appears to hang due to a command it runs waiting for interactive user input, but I haven’t seen that happen for the kernel module build in particular. It would be interesting to see if the observed behavior can be reproduced outside of the installer.

jfirebaugh commented 1 year ago

tail -f /var/log/nvidia-installer.log

It's a continuous stream of compiler warnings and errors. Here's a sample: https://gist.github.com/jfirebaugh/fa166d3c4431a4d2fb0a5af21f0ec08b

Running make directly

Similar errors, but they eventually come to an end, unlike tailing the log above.

aaronp24 commented 3 months ago

Is this still a problem? There was an O(n^2) algorithm in the string handling code that caused run_command to take an exponential amount of time depending on the number of lines of the kernel build output. That was fixed in commit 251ec51c59e8593f6ece68190070b6d63aca2275. Closing as fixed, but please let me know if you still see this.