NVlabs / instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more
https://nvlabs.github.io/instant-ngp
Other
15.83k stars 1.9k forks source link

SIGILL on fedora 35 nvidia gtx 970 (illigal instructions) #418

Open Vertecedoc4545 opened 2 years ago

Vertecedoc4545 commented 2 years ago

i have fedora 35, with the drivers from the rpm fusion repocitoryes (v510) i intalled the cuda toolkit from the official nvidia page and added /usr/local/ ...etc /targets/linux/lib to the file /etc/ld.conf.d/cuda.conf and then added the bin folder to the path with a script on the profile.d folder and then folowed the instriucctions for building, it build well but i got this warning when:

$ cmake . -B build  
      CMake Warning at dependencies/tiny-cuda-nn/CMakeLists.txt:112 (message):
      Fully fused MLPs do not support GPU architectures of 70 or less.  Falling
      back to CUTLASS MLPs.  Remove GPU architectures 70 and lower to allow
     maximum performance

and those when compiling :

$ cmake  --build build --config RelWithDebInfo -j 4

     Warning #20014-D: calling a __host__ function from a __host__ __device__ function is not allowed
                detected during:
                 instantiation of "void Eigen::internal::triangular_solver_selector<Lhs, Rhs, Side, Mode, 0, -1>::run(const Lhs &, Rhs &) [with           Lhs=Eigen::Ref<Eigen::Matrix<float, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1>>, Rhs=Eigen::Ref<Eigen::Matrix<float, -1, -1, 0, -1,  -1>, 0, Eigen::OuterStride<-1>>, Side=1, Mode=5]" 

then at the time of running the program i get SIGILL

debbuging with gdb i get this:

Program received signal SIGILL, Illegal instruction.

0x000000000043e86d in args::ArgumentParser::ArgumentParser(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
Vertecedoc4545 commented 2 years ago

elcatos@192.168.1.9


OS: Fedora Linux 35 (Workstation Edition) x86_64 Host: Precision WorkStation T7500 Kernel: 5.16.18-200.fc35.x86_64 Uptime: 54 mins Packages: 2158 (rpm), 17 (flatpak) Shell: fish 3.3.1 Resolution: 1920x1080 DE: GNOME 41.4 WM: Mutter WM Theme: Adwaita Theme: Adwaita [GTK2/3] Icons: Adwaita [GTK2/3] Terminal: kitty CPU: Intel Xeon X5560 (4) @ 2.986GHz GPU: NVIDIA GeForce GTX 970 Memory: 2412MiB / 15970MiB

Vertecedoc4545 commented 2 years ago

cmake version 3.22.2

hradec commented 2 years ago

I'm seeing the same on arch linux, with a custom build gcc 9.3.1 and python 3. Running testbed with strace, I can see it crashes right after opening libcuda.so:

openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=480823, ...}) = 0
mmap(NULL, 480823, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f27f1cc1000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libcuda.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320~\r\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=22851528, ...}) = 0
close(3)                                = 0
sched_get_priority_max(SCHED_RR)        = 99
sched_get_priority_min(SCHED_RR)        = 1
munmap(0x7f27f1cc1000, 480823)          = 0
brk(0x21f8000)                          = 0x21f8000
openat(AT_FDCWD, "/sys/devices/system/cpu", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0
getdents64(3, 0x21d8a00 /* 29 entries */, 32768) = 832
getdents64(3, 0x21d8a00 /* 0 entries */, 32768) = 0
close(3)                                = 0
sched_getaffinity(2847762, 8, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) = 8
futex(0x7f27f0359464, FUTEX_WAKE_PRIVATE, 2147483647) = 0
brk(0x2219000)                          = 0x2219000
brk(0x223a000)                          = 0x223a000
brk(0x225b000)                          = 0x225b000
brk(0x2288000)                          = 0x2288000
futex(0x11b0604, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x7f27f23266fc, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x7f27f2326708, FUTEX_WAKE_PRIVATE, 2147483647) = 0
brk(0x22a9000)                          = 0x22a9000
--- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPN, si_addr=0x4407bb} ---
+++ killed by SIGILL (core dumped) +++
Illegal instruction (core dumped)

I'm running nvidia driver 510.54, and building with cuda 11.6.1_510.47.03.

Does the 510.47.03 in the cuda version 11.6.1_510.47.03 means it is for the driver 510.47.03? So do I have to download a new cuda SDK everytime I update the nvidia driver?

This whole separation of driver and sdks that nvidia does is just so very confusing...