NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.25k stars 1.29k forks source link

[Installation] Fails to Build/Install for LLVM Thin-LTO Running Kernels #214

Open ThisNekoGuy opened 2 years ago

ThisNekoGuy commented 2 years ago

NVIDIA Driver Version

515.43.04 (to install)

GPU RTX 2080 Ti

Describe the bug

Attempting to install the kernel modules with Clang using an LLVM thin-LTO compiled kernel fails: build.log

Related to:

For reference, my kernel is: Linux 5.17.2-256-tkg-pds-llvm

To Reproduce

## *Should* be sufficient to reproduce, though I had a few extra flags
make modules -j`nproc`          \
    TARGET_ARCH=x86_64         \
    CC=clang    \
    CXX=clang++   \
    LD="/usr/bin/ld.lld"     \
    AR="/usr/bin/llvm-ar"     \
    OBJCOPY="/usr/bin/llvm-objcopy"

## My flags:
make modules -j`nproc`          \
    TARGET_ARCH=x86_64         \
    CC=clang    \
    CXX=clang++   \
    LD="/usr/bin/ld.lld"     \
    AR="/usr/bin/llvm-ar"     \
    NM="/usr/bin/llvm-nm"     \
    RANLIB="/usr/bin/llvm-ranlib"     \
    STRIP="/usr/bin/llvm-strip"     \
    OBJCOPY="/usr/bin/llvm-objcopy"
    CFLAGS="-march=znver2 -mtune=znver2 -O3 -pipe -fno-plt -minline-all-stringops -fexceptions -Wall  -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection"     \
    CXXFLAGS="$CFLAGS -Wp,-D_GLIBCXX_ASSERTIONS"     \
    LDFLAGS="-Wl,-O3,--sort-common,--as-needed,-z,relro,-z,now"

Expected behavior

The toolchain and LTO status of the running kernel should be irrelevant to the ability to link and install the open GPU kernel modules. (This isn't an issue with the proprietary modules.)

aritger commented 2 years ago

From the attached build log, I see:

Error: Permission denied
LLVM ERROR: ThinLTO: Can't get a temporary file
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Error: Permission denied
LLVM ERROR: ThinLTO: Can't get a temporary file
  LTO [M] /home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open/nvidia-modeset.lto.o
Error: Permission denied
LLVM ERROR: ThinLTO: Can't get a temporary file
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
/home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open/nvidia-drm/nvidia-drm-gem-nvkms-memory.c:510:43: warning: variable 'nv_nvkms_memory_src' set but not used [-Wunused-but-set-variable]
    const struct nv_drm_gem_nvkms_memory *nv_nvkms_memory_src;
                                          ^
 #0 0x00007fb3519032d5 (/usr/lib/libLLVM-13.so+0xba32d5)
 #1 0x00007fb351900ab6 (/usr/lib/libLLVM-13.so+0xba0ab6)
 #2 0x00007fb3509538e0 (/usr/lib/libc.so.6+0x3e8e0)
 #3 0x00007fb3509a336c (/usr/lib/libc.so.6+0x8e36c)
 #4 0x00007fb350953838 gsignal (/usr/lib/libc.so.6+0x3e838)
 #5 0x00007fb35093d535 abort (/usr/lib/libc.so.6+0x28535)
 #6 0x00007fb35181a408 llvm::report_fatal_error(llvm::Twine const&, bool) (/usr/lib/libLLVM-13.so+0xaba408)
 #7 0x00007fb35181a5be (/usr/lib/libLLVM-13.so+0xaba5be)
 #8 0x00007fb3533139bb (/usr/lib/libLLVM-13.so+0x25b39bb)
 #9 0x00007fb353313a47 (/usr/lib/libLLVM-13.so+0x25b3a47)
#10 0x00007fb353337c1d (/usr/lib/libLLVM-13.so+0x25d7c1d)
#11 0x00007fb353339033 llvm::lto::thinBackend(llvm::lto::Config const&, unsigned int, std::function<std::unique_ptr<llvm::lto::NativeObjectStream, std::default_delete<llvm::lto::NativeObjectStream> > (unsigned int)>, llvm::Module&, llvm::ModuleSummaryIndex const&, llvm::StringMap<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> >, llvm::MallocAllocator> const&, llvm::DenseMap<unsigned long, llvm::GlobalValueSummary*, llvm::DenseMapInfo<unsigned long>, llvm::detail::DenseMapPair<unsigned long, llvm::GlobalValueSummary*> > const&, llvm::MapVector<llvm::StringRef, llvm::BitcodeModule, llvm::DenseMap<llvm::StringRef, unsigned int, llvm::DenseMapInfo<llvm::StringRef>, llvm::detail::DenseMapPair<llvm::StringRef, unsigned int> >, std::vector<std::pair<llvm::StringRef, llvm::BitcodeModule>, std::allocator<std::pair<llvm::StringRef, llvm::BitcodeModule> > > >*, std::vector<unsigned char, std::allocator<unsigned char> > const&) (/usr/lib/libLLVM-13.so+0x25d9033)
#12 0x00007fb3533212e1 (/usr/lib/libLLVM-13.so+0x25c12e1)
#13 0x00007fb351882d76 (/usr/lib/libLLVM-13.so+0xb22d76)
#14 0x00007fb35185abbd (/usr/lib/libLLVM-13.so+0xafabbd)
#15 0x00007fb3509a6567 (/usr/lib/libc.so.6+0x91567)
#16 0x00007fb351884b74 (/usr/lib/libLLVM-13.so+0xb24b74)
#17 0x00007fb3509a154d (/usr/lib/libc.so.6+0x8c54d)
#18 0x00007fb350a26b14 __clone (/usr/lib/libc.so.6+0x111b14)
1 warning generated.

I'm not too familiar with LLVM Thin-LTO. Does it normally print the "Can't get a temporary file" error message? It would probably be best to start by investigating that, which doesn't seem like an open-gpu-kernel-modules bug.

Beyond that, let's start with a simpler reproduction: can you confirm whether the minimal reproduction steps you listed trigger the problem? If not, can you incrementally add to CFLAGS, et al, to determine the minimal reproductino configuration?

ThisNekoGuy commented 2 years ago

I did as such and the result doesn't seem much different at all (no CFLAGS and fewest variables, by the way):

. . .
  AR [M]  /home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open/nvidia-drm.o
  LTO [M] /home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open/nvidia-drm.lto.o
Error: Permission denied
LLVM ERROR: ThinLTO: Can't get a temporary file
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
 #0 0x00007f82ea6602d5 (/usr/lib/libLLVM-13.so+0xba32d5)
 #1 0x00007f82ea65dab6 (/usr/lib/libLLVM-13.so+0xba0ab6)
 #2 0x00007f82e96b08e0 (/usr/lib/libc.so.6+0x3e8e0)
 #3 0x00007f82e970036c (/usr/lib/libc.so.6+0x8e36c)
 #4 0x00007f82e96b0838 gsignal (/usr/lib/libc.so.6+0x3e838)
 #5 0x00007f82e969a535 abort (/usr/lib/libc.so.6+0x28535)
 #6 0x00007f82ea577408 llvm::report_fatal_error(llvm::Twine const&, bool) (/usr/lib/libLLVM-13.so+0xaba408)
 #7 0x00007f82ea5775be (/usr/lib/libLLVM-13.so+0xaba5be)
 #8 0x00007f82ec0709bb (/usr/lib/libLLVM-13.so+0x25b39bb)
 #9 0x00007f82ec070a47 (/usr/lib/libLLVM-13.so+0x25b3a47)
#10 0x00007f82ec094c1d (/usr/lib/libLLVM-13.so+0x25d7c1d)
#11 0x00007f82ec096033 llvm::lto::thinBackend(llvm::lto::Config const&, unsigned int, std::function<std::unique_ptr<llvm::lto::NativeObjectStream, std::default_delete<llvm::lto::NativeObjectStream> > (unsigned int)>, llvm::Module&, llvm::ModuleSummaryIndex const&, llvm::StringMap<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> >, llvm::MallocAllocator> const&, llvm::DenseMap<unsigned long, llvm::GlobalValueSummary*, llvm::DenseMapInfo<unsigned long>, llvm::detail::DenseMapPair<unsigned long, llvm::GlobalValueSummary*> > const&, llvm::MapVector<llvm::StringRef, llvm::BitcodeModule, llvm::DenseMap<llvm::StringRef, unsigned int, llvm::DenseMapInfo<llvm::StringRef>, llvm::detail::DenseMapPair<llvm::StringRef, unsigned int> >, std::vector<std::pair<llvm::StringRef, llvm::BitcodeModule>, std::allocator<std::pair<llvm::StringRef, llvm::BitcodeModule> > > >*, std::vector<unsigned char, std::allocator<unsigned char> > const&) (/usr/lib/libLLVM-13.so+0x25d9033)
#12 0x00007f82ec07e2e1 (/usr/lib/libLLVM-13.so+0x25c12e1)
#13 0x00007f82ea5dfd76 (/usr/lib/libLLVM-13.so+0xb22d76)
#14 0x00007f82ea5b7bbd (/usr/lib/libLLVM-13.so+0xafabbd)
#15 0x00007f82e9703567 (/usr/lib/libc.so.6+0x91567)
#16 0x00007f82ea5e1b74 (/usr/lib/libLLVM-13.so+0xb24b74)
#17 0x00007f82e96fe54d (/usr/lib/libc.so.6+0x8c54d)
#18 0x00007f82e9783b14 __clone (/usr/lib/libc.so.6+0x111b14)
  AR [M]  /home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open/nvidia-peermem.o
Error: Permission denied
LLVM ERROR: ThinLTO: Can't get a temporary file
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Error: Permission denied
LLVM ERROR: ThinLTO: Can't get a temporary file
  LTO [M] /home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open/nvidia-peermem.lto.o
Error: Permission denied
LLVM ERROR: ThinLTO: Can't get a temporary file
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
 #0 0x00007f62c80a52d5 (/usr/lib/libLLVM-13.so+0xba32d5)
 #1 0x00007f62c80a2ab6 (/usr/lib/libLLVM-13.so+0xba0ab6)
 #2 0x00007f62c70f58e0 (/usr/lib/libc.so.6+0x3e8e0)
 #3 0x00007f62c714536c (/usr/lib/libc.so.6+0x8e36c)
 #4 0x00007f62c70f5838 gsignal (/usr/lib/libc.so.6+0x3e838)
 #5 0x00007f62c70df535 abort (/usr/lib/libc.so.6+0x28535)
 #6 0x00007f62c7fbc408 llvm::report_fatal_error(llvm::Twine const&, bool) (/usr/lib/libLLVM-13.so+0xaba408)
 #7 0x00007f62c7fbc5be (/usr/lib/libLLVM-13.so+0xaba5be)
 #8 0x00007f62c9ab59bb (/usr/lib/libLLVM-13.so+0x25b39bb)
 #9 0x00007f62c9ab5a47 (/usr/lib/libLLVM-13.so+0x25b3a47)
#10 0x00007f62c9ad9c1d (/usr/lib/libLLVM-13.so+0x25d7c1d)
#11 0x00007f62c9adb033 llvm::lto::thinBackend(llvm::lto::Config const&, unsigned int, std::function<std::unique_ptr<llvm::lto::NativeObjectStream, std::default_delete<llvm::lto::NativeObjectStream> > (unsigned int)>, llvm::Module&, llvm::ModuleSummaryIndex const&, llvm::StringMap<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> >, llvm::MallocAllocator> const&, llvm::DenseMap<unsigned long, llvm::GlobalValueSummary*, llvm::DenseMapInfo<unsigned long>, llvm::detail::DenseMapPair<unsigned long, llvm::GlobalValueSummary*> > const&, llvm::MapVector<llvm::StringRef, llvm::BitcodeModule, llvm::DenseMap<llvm::StringRef, unsigned int, llvm::DenseMapInfo<llvm::StringRef>, llvm::detail::DenseMapPair<llvm::StringRef, unsigned int> >, std::vector<std::pair<llvm::StringRef, llvm::BitcodeModule>, std::allocator<std::pair<llvm::StringRef, llvm::BitcodeModule> > > >*, std::vector<unsigned char, std::allocator<unsigned char> > const&) (/usr/lib/libLLVM-13.so+0x25d9033)
#12 0x00007f62c9ac32e1 (/usr/lib/libLLVM-13.so+0x25c12e1)
#13 0x00007f62c8024d76 (/usr/lib/libLLVM-13.so+0xb22d76)
#14 0x00007f62c7ffcbbd (/usr/lib/libLLVM-13.so+0xafabbd)
#15 0x00007f62c7148567 (/usr/lib/libc.so.6+0x91567)
#16 0x00007f62c8026b74 (/usr/lib/libLLVM-13.so+0xb24b74)
#17 0x00007f62c714354d (/usr/lib/libc.so.6+0x8c54d)
#18 0x00007f62c71c8b14 __clone (/usr/lib/libc.so.6+0x111b14)
make[3]: *** [scripts/Makefile.build:308: /home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open/nvidia-peermem.lto.o] Error 134
make[3]: *** Waiting for unfinished jobs....
make[3]: *** [scripts/Makefile.build:308: /home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open/nvidia-modeset.lto.o] Error 134
make[3]: *** [scripts/Makefile.build:308: /home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open/nvidia-drm.lto.o] Error 134
make[3]: *** [scripts/Makefile.build:308: /home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open/nvidia.lto.o] Error 134
make[3]: *** [scripts/Makefile.build:308: /home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open/nvidia-uvm.lto.o] Error 134
make[2]: *** [Makefile:1831: /home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open] Error 2
make[2]: Leaving directory '/usr/lib/modules/5.17.2-256-tkg-pds-llvm/build'
make[1]: *** [Makefile:82: modules] Error 2
make[1]: Leaving directory '/home/neko-san/nvidia/nvidia-all/src/open-gpu-kernel-modules-515.43.04/kernel-open'
make: *** [Makefile:50: modules] Error 2

I'm not sure about the "Can't get a temporary file" nor the "Permission denied" messages either

xin201501 commented 2 years ago

Try to add sudo and try again.I'm not sure this will work but I had a permission denied issue as well but can't remember the detailed LLVM bug info.I solved the problem using this method.

ptr1337 commented 2 years ago

You can try to change the thinlto cachedir with the following patch, does work also for 5.17 the patch: https://github.com/ptr1337/kernel-patches/blob/master/5.18/0001-thinlto-cachdir.patch

This fixes for me the compile of zfs modules with thinlto.

ThisNekoGuy commented 2 years ago

Try to add sudo and try again.I'm not sure this will work but I had a permission denied issue as well but can't remember the detailed LLVM bug info.I solved the problem using this method.

Attempting to build the modules normally / by hand (instead of using https://github.com/frogging-family/nvidia-all) using sudo resulted in it building seemingly halfway then somehow failing the toolchain check and picked GCC instead of Clang (as I specified) and fails to build for that reason instead. Which, is definitely a separate problem, but also definitely a blocker for this issue as well...

This is quite a painful issue :/

ThisNekoGuy commented 2 years ago

You can try to change the thinlto cachedir with the following patch, does work also for 5.17 the patch: https://github.com/ptr1337/kernel-patches/blob/master/5.18/0001-thinlto-cachdir.patch

This fixes for me the compile of zfs modules with thinlto.

@ptr1337 I tried this and it made little difference; it seems to progress further(?) but it ultimately fails anyway: nvidia-open-dkms.log

xin201501 commented 2 years ago

You can try to change the thinlto cachedir with the following patch, does work also for 5.17 the patch: https://github.com/ptr1337/kernel-patches/blob/master/5.18/0001-thinlto-cachdir.patch This fixes for me the compile of zfs modules with thinlto.

@ptr1337 I tried this and it made little difference; it seems to progress further(?) but it ultimately fails anyway: nvidia-open-dkms.log

According to bottom lines of the log,it's much like what I met last week and had no solutions yet.https://bugs.archlinux.org/task/74714

ptr1337 commented 2 years ago

@aritger Actually llvm-15 fails complete even with the nvidia proprietary driver. I did a issue on llvm on llvm and several checks needs to get adjusted: https://github.com/llvm/llvm-project/issues/55820