Closed jfirebaugh closed 3 months ago
Hi. Here are a few ideas:
There are a lot of TIOCGWINSZs in that strace. Does the behavior differ if you use "--silent" instead of "--ui=none"?
From another shell, can you run tail -f /var/log/nvidia-installer.log
while installation is going? That should give a higher-level view of what is happening.
Building nvidia-uvm.ko can be slow (especially single-threaded), but not as slow as you describe. Does -jnproc
on the .run line help? The installer should auto-detect the number of CPUs and perform the build in parallel. But, maybe something about the AWS environment is confusing it.
I hope that helps.
I got further with this by trying older versions. One of them (I forget which) told me that my GCC version was mismatched with my kernel version. 525.85.12 didn't give me that warning (even with the same kernel and same GCC), but when I added CC=gcc10-cc
it did build in a reasonable amount of time. My guess is that building with default GCC spews a lot of errors and/or warnings to the output, and nvidia-install doesn't deal with a large amount of output efficiently (from looking at the source code I see pipes are involved).
Do you observe any difference between building with one compiler or the other if you build the kernel modules outside of nvidia-installer? You can do this by following the below procedure:
make
: you can override the compiler by setting CC in the environment or on make’s command line.I wouldn’t expect copious build output to slow the installer down to the point where it doesn’t complete after an hour. I have seen occasional cases where nvidia-installer appears to hang due to a command it runs waiting for interactive user input, but I haven’t seen that happen for the kernel module build in particular. It would be interesting to see if the observed behavior can be reproduced outside of the installer.
tail -f /var/log/nvidia-installer.log
It's a continuous stream of compiler warnings and errors. Here's a sample: https://gist.github.com/jfirebaugh/fa166d3c4431a4d2fb0a5af21f0ec08b
Running
make
directly
Similar errors, but they eventually come to an end, unlike tailing the log above.
Is this still a problem? There was an O(n^2) algorithm in the string handling code that caused run_command
to take an exponential amount of time depending on the number of lines of the kernel build output. That was fixed in commit 251ec51c59e8593f6ece68190070b6d63aca2275. Closing as fixed, but please let me know if you still see this.
I'm trying to run nvidia-installer on an AWS g4dn.xlarge instance. The script I'm using is:
The "Building kernel modules" step of the install reaches 50% quickly, then gets progressively slower and slower, before seeming to hang at 100%. This was reproduced by an AWS support technician, who was using ami-02771ba9f2783d0a in us-west-2. For him, the install did finally complete after approximately 1 hour. I have not been successful (got disconnected from the instance after approximately 40 minutes).
During the duration, the nvidia-installer process consistently consumes 100% of one CPU. Here is a sample of strace output from the process.
Is it expected that installation takes this long? It's way more than what I would expect for building some kernel modules.