NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.19k stars 1.28k forks source link

Linux 5.18 NVIDIA module won't load: ` Missing ENDBR: _portMemAllocatorAllocNonPagedWrapper+0x0/0x10` #256

Closed rnd-ash closed 2 years ago

rnd-ash commented 2 years ago

NVIDIA Open GPU Kernel Modules Version

515.43.04

Does this happen with the proprietary driver (of the same version) as well?

Yes

Operating System and Version

Arch Linux

Kernel Release

5.18.0-arch1-1

Hardware: GPU

RTX 3070 laptop (System 76 Oryx 8)

Describe the bug

Since upgrading to Kernel 5.18, loading the nvidia driver (Or proprietary one) fails with the same kernel log:

[    5.429675] nvidia-nvlink: Nvlink Core is being initialized, major device number 510

[    5.429718] traps: Missing ENDBR: _portMemAllocatorAllocNonPagedWrapper+0x0/0x10 [nvidia]
[    5.429816] ------------[ cut here ]------------
[    5.429817] kernel BUG at arch/x86/kernel/traps.c:252!
[    5.429828] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    5.429830] CPU: 9 PID: 948 Comm: modprobe Tainted: G           OE     5.18.0-arch1-1 #1 b71a70fe104889aac2f32556bc52f649da2881d2
[    5.429832] Hardware name: System76 Oryx Pro/Oryx Pro, BIOS 2021-09-23_b9b0e89 09/23/2021
[    5.429833] RIP: 0010:exc_control_protection+0xc2/0xd0
[    5.429837] Code: 8b 93 80 00 00 00 be f9 00 00 00 48 c7 c7 d3 ab 66 b5 e8 d1 01 50 ff e9 72 ff ff ff 48 c7 c7 ba ab 66 b5 e8 c7 31 fb ff 0f 0b <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 66 0f 1f 00 55 53 48 89
[    5.429838] RSP: 0018:ffffa9c3413b3bb8 EFLAGS: 00010002
[    5.429839] RAX: 000000000000004d RBX: ffffa9c3413b3bd8 RCX: 0000000000000027
[    5.429840] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9d195fa616a0
[    5.429841] RBP: 0000000000000003 R08: 0000000000000000 R09: ffffa9c3413b39d8
[    5.429842] R10: 0000000000000003 R11: ffffffffb5ecaa08 R12: 0000000000000000
[    5.429842] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    5.429843] FS:  00007f0aa9bbe740(0000) GS:ffff9d195fa40000(0000) knlGS:0000000000000000
[    5.429844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.429845] CR2: 00007f0aa8382000 CR3: 00000001063ce002 CR4: 0000000000f70ee0
[    5.429846] PKRU: 55555554
[    5.429847] Call Trace:
[    5.429848]  <TASK>
[    5.429849]  asm_exc_control_protection+0x22/0x30
[    5.429852] RIP: 0010:_portMemAllocatorAllocNonPagedWrapper+0x0/0x10 [nvidia]
[    5.429920] Code: 08 48 89 d0 48 89 0f 48 c1 e0 17 48 31 c2 48 89 c8 48 c1 e8 05 48 31 c8 48 31 d0 48 c1 ea 12 48 31 d0 48 89 47 08 01 c8 c3 90 <48> 89 f7 e9 38 0f 00 00 0f 1f 84 00 00 00 00 00 48 89 f7 e9 88 0f
[    5.429921] RSP: 0018:ffffa9c3413b3c80 EFLAGS: 00010202
[    5.429922] RAX: ffffffffc1eae5f0 RBX: 0000000000000010 RCX: 0000000000000000
[    5.429923] RDX: 0000000000000000 RSI: 000000000000002c RDI: ffffffffc20f7b70
[    5.429923] RBP: ffffa9c3413b3c98 R08: 0000000000000020 R09: ffffffffc20f7bf0
[    5.429924] R10: ffffffffc20f55d0 R11: 0000000000000000 R12: ffffffffc20f7b70
[    5.429925] R13: 00007f0aa8382dc0 R14: 000055916224ef30 R15: ffffa9c3413b3e20
[    5.429926]  ? portCryptoPseudoRandomGeneratorGetU32+0x30/0x30 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.429991]  _portMemAllocatorAlloc+0x2e/0x170 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430054]  portCryptoPseudoRandomGeneratorCreate+0x16/0xb0 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430117]  portCryptoInitialize+0x2a/0x40 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430182]  portInitialize+0x2b/0x40 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430246]  coreInitializeRm+0x24/0x90 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430324]  RmInitRm+0x9/0x20 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430399]  rm_init_rm+0x9/0x10 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430472]  nvidia_init_module+0x22e/0x5b0 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430517]  ? nvidia_init_module+0x5b0/0x5b0 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430565]  nvidia_frontend_init_module+0x50/0x91 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430616]  ? nvidia_init_module+0x5b0/0x5b0 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430663]  do_one_initcall+0x5a/0x220
[    5.430667]  do_init_module+0x4a/0x240
[    5.430670]  __do_sys_init_module+0x138/0x1b0
[    5.430672]  do_syscall_64+0x5c/0x90
[    5.430674]  ? exc_page_fault+0x74/0x170
[    5.430676]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    5.430677] RIP: 0033:0x7f0aa9512c3e
[    5.430679] Code: 48 8b 0d 5d b1 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2a b1 0e 00 f7 d8 64 89 01 48
[    5.430680] RSP: 002b:00007fff39f3cc58 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    5.430681] RAX: ffffffffffffffda RBX: 000055916224ebd0 RCX: 00007f0aa9512c3e
[    5.430682] RDX: 000055916224ef30 RSI: 00000000008f1db0 RDI: 00007f0aa7a91010
[    5.430682] RBP: 00007f0aa7a91010 R08: 000055916224eae0 R09: 0000000000000000
[    5.430683] R10: 0000000000000005 R11: 0000000000000246 R12: 000055916224ef30
[    5.430684] R13: 000055916224ed00 R14: 000055916224ebd0 R15: 000055916224ef60
[    5.430685]  </TASK>
[    5.430685] Modules linked in: pcc_cpufreq(-) nvidia(OE+) acpi_cpufreq(-) bnep bridge stp llc btusb btrtl btbcm btintel uvcvideo btmtk videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common ecdh_generic videodev mc snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_hda_codec_realtek snd_sof_intel_hda snd_hda_codec_generic snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda iwlmvm snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi joydev intel_tcc_cooling soundwire_bus mousedev ledtrig_audio mac80211 x86_pkg_temp_thermal intel_powerclamp snd_soc_core coretemp snd_compress ac97_bus kvm_intel libarc4 hid_multitouch snd_hda_codec_hdmi 8250_dw spi_nor mei_pxp snd_pcm_dmaengine mei_hdcp ee1004 mtd i915 iTCO_wdt snd_hda_intel kvm intel_pmc_bxt snd_intel_dspcfg iTCO_vendor_support intel_rapl_msr iwlwifi irqbypass snd_intel_sdw_acpi snd_hda_codec crct10dif_pclmul crc32_pclmul
[    5.430709]  ghash_clmulni_intel snd_hda_core iwlmei vfat aesni_intel processor_thermal_device_pci_legacy processor_thermal_device pmt_telemetry snd_hwdep crypto_simd pmt_class cryptd fat intel_cstate r8169 drm_buddy cfg80211 intel_uncore snd_pcm processor_thermal_rfim realtek psmouse ttm processor_thermal_mbox mei_me snd_timer rfkill pcspkr i2c_i801 mdio_devres processor_thermal_rapl intel_lpss_pci spi_intel_pci intel_rapl_common snd libphy intel_lpss drm_dp_helper spi_intel i2c_smbus soundcore int340x_thermal_zone thunderbolt mei i2c_hid_acpi idma64 intel_gtt intel_vsec intel_soc_dts_iosf i2c_hid intel_hid video intel_scu_pltdrv sparse_keymap system76_acpi mac_hid coreboot_table dm_multipath dm_mod ipmi_devintf ipmi_msghandler crypto_user acpi_call(OE) fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 serio_raw atkbd uas libps2 usb_storage usbhid vivaldi_fmap nvme xhci_pci nvme_core crc32c_intel i8042 xhci_pci_renesas serio
[    5.430736] ---[ end trace 0000000000000000 ]---

To Reproduce

  1. Upgrade to kernel 5.18
  2. Reboot
  3. Observe nvidia module won't load and check kernel logs for the same error

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

Originally I thought this issue was to do with optimus-manager (As I am using a hybrid setup I use that utility to switch between intel and nvidia mode), but after uninstalling optimus manager the same issue occurs

gauravjuvekar commented 2 years ago

Hi, I couldn't repro this with 5.18.0-arch1-1 from testing. Can you try with nvidia-open-dkms-515.43.04-8 just to be sure that the kernel module was rebuilt with the matching kernel headers?

rnd-ash commented 2 years ago

I was using the nvidia-open-dkms package, had run mkinitcpio multiple times and DKMS said it was installing the open modules for 5.18, so I assume I had matching headers

Downgraded to kernel 5.17.9-arch1-1 and everything works for me

aritger commented 2 years ago

@rnd-ash: Could you experiment to see if the same ENDBR error happens with: (1) the open kernel modules packaged with the NVIDIA .run file (i.e., install from .run file with -m=kernel-open) (2) the closed kernel modules packaged with the NVIDIA .run file (i.e., install from .run file with -m=kernel)

I'm curious if the problem has something to do with how the open nvidia.ko was built by arch-linux (maybe something about the toolchain used). I think experiments (1) and (2) should help shake that out.

It looks like ENDBR is new in 5.18. I wonder if the problem here only manifests with certain kernel kconfigs. E.g., maybe it requires X86_KERNEL_IBT

rnd-ash commented 2 years ago

From archlinux's config file, I can see that on the problematic kernel version, X86_KERNEL_IBT is enabled here

I tried to download the .run file from https://us.download.nvidia.com/XFree86/Linux-x86_64/515.43.04/NVIDIA-Linux-x86_64-515.43.04.run, but every time I tried to run it I kept getting installation failed.

However, I switched over to try both the nvidia-open-dkms and nvidia-dkms packages from arch (PKGBUILDs can be seen here and here), and they all result in the same ENDBR error.

atiensivu commented 2 years ago

Does it work if you pass the kernel ' ibt=off' ?

rnd-ash commented 2 years ago

Does it work if you pass the kernel ' ibt=off' ?

Just tried it, it does!

danrbball1 commented 2 years ago

ibt=off also works for me from grub. I am running Arcolinux on a Dell XPS 9520 (NVIDIA 3050 and 16 GB ram). I am also running NVIDIA Prime. Any idea what the issue may be?

QuestionMark001 commented 2 years ago

GPU: NVIDIA RTX 3060 laptop Driver Version: Closed NVIDIA Driver 515.43.04 I also faced this problem.If you updated Kernel to Linux 5.18,will display "Failed start to Linux Kernel".

QuestionMark001 commented 2 years ago

Does it work if you pass the kernel ' ibt=off' ?

Perfect fix😉

Six6pounder commented 2 years ago

Same issue here. Kernel: 5.18 - GPU Driver: NVIDIA 515.43.04 - Rtx 3080 desktop

@atiensivu thank you, what does ibt=off do? It boots if I use it

edjubert commented 2 years ago

I also faced this problem. Kernel: 5.18.0-arch1-1 GPU Driver: NVIDIA 515.43.04 - RTX 3070 (laptop)

Optimus-manager failed. Adding ibt=off to bootloader (grub for me) fixed it

kv-y commented 2 years ago

what does ibt=off do?

Indirect Branch Tracking

Add support for Intel CET-IBT (Indirect Branch Tracking), a hardware support course-grain forward-edge Control Flow Integrity protection. It enforces that all indirect calls must land on an ENDBR instruction, as such, the compiler will instrument the code with them to make this happen.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7001052160d172f6de06adeffde24dde9935ece8

Tudmotu commented 2 years ago

Anyone knows how to add parameter this when using EFIStub?

Edit: Downgraded my kernel for now. I found this issue after my boot was hanging on "start job is running for Load Kernel Modules". To downgrade:

ocelik94 commented 2 years ago

same goes for me

Kernel: 5.18.0-arch1-1 GPU Driver: NVIDIA 515.43.04 - RTX 2080

mahancoder commented 2 years ago

Happens to me too Kernel: 5.18.0-arch1-1 Driver: NVIDIA 515.43.04 GPU: MX450

Setting ibt=off fixes the issue temporarily, but cannot be considered a full solution.

CryptLabs commented 2 years ago

I can confirm, I have the same issue. Arch Linux 5.18.0-zen1-1-zen Nvidia RTX 5000

How can we fix this issue?

mahancoder commented 2 years ago

How can we fix this issue?

@CryptLabs You can temporarily fix the issue by adding ibt=off to your kernel command line parameters

EugeneKorshenko commented 2 years ago

I can confirm the same issue on my laptop.

Arch Linux 5.18.0-arch1-1 Driver Version: 515.43.04

RTX 3070 Laptop

CryptLabs commented 2 years ago

@mahancoder I have used ibt=off. However, as you said, I feel that this is not a good solution.

domino14 commented 2 years ago

please fix

edjubert commented 2 years ago

I just installed latest nvidia dkms drivers (515.43.04-2) and the fix does not work anymore

m1guelperez commented 2 years ago

5.18.0-arch1-1 nvidia-dkms 515.43.04-2 Nvidia GTX1080

Same problem here latest Nvidia driver literally broke my system. I was stuck on Reached target Graphical Interface. And received several errors. Only solution to interact with the system was CTRL+ALT+F2 .

What fixed the Issue: Either uninstall everything all Nvidia packages or pass the ibt=off flag to the kernel parameters:

I just installed latest nvidia drivers (515.43.04-6) and the fix does not work anymore

Did you try to remove the ibt=off flag when using the latest Nvidia driver?

codicocodes commented 2 years ago

I had the same issue this morning when updating the drivers. It seems like ibt=off was removed from my kernel options in my latest update, when I re added it back the drivers started working again.

rnd-ash commented 2 years ago

There appears to be an open bug now on Archlinux about this issue https://bugs.archlinux.org/task/74891

edjubert commented 2 years ago

@m1guelperez yes, the first thing I've done is to remove and reboot but still not working.

Also, I'm not sure it's related, but even with nvidia drivers loading properly, HDMI does not seems to work with the workaround (my HDMI is wired to my GPU)

mtijanic commented 2 years ago

The following patch will insert the necessary endbr64 instructions:

diff --git a/src/nvidia-modeset/Makefile b/src/nvidia-modeset/Makefile
index c63b86b..69490d0 100644
--- a/src/nvidia-modeset/Makefile
+++ b/src/nvidia-modeset/Makefile
@@ -95,7 +95,6 @@ CFLAGS += -ffunction-sections
 CFLAGS += -fdata-sections
 CFLAGS += -ffreestanding

-CONDITIONAL_CFLAGS := $(call TEST_CC_ARG, -fcf-protection=none)
 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -Wformat-overflow=2)
 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -Wformat-truncation=1)
 ifeq ($(TARGET_ARCH),x86_64)
diff --git a/src/nvidia/Makefile b/src/nvidia/Makefile
index 9bdb826..cc05ab7 100644
--- a/src/nvidia/Makefile
+++ b/src/nvidia/Makefile
@@ -119,8 +119,6 @@ CFLAGS += -fdata-sections
 NV_KERNEL_O_LDFLAGS += --gc-sections
 EXPORTS_LINK_COMMAND = exports_link_command.txt

-CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -fcf-protection=none)
-
 ifeq ($(TARGET_ARCH),x86_64)
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mindirect-branch-register)
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mindirect-branch=thunk-extern)

Is there anyone facing these problems that can try rebuilding the modules with the patch and report back?

I'm not sure why the -fcf-protection=none is there in the first place, but I expect it was an attempt to minimize the code size.

m1guelperez commented 2 years ago

@m1guelperez yes, the first thing I've done is to remove and reboot but still not working.

Also, I'm not sure it's related, but even with nvidia drivers loading properly, HDMI does not seems to work with the workaround (my HDMI is wired to my GPU)

Hmm, I can't help you there since I use DP. But I will definitely wait with any updates for now. 😄

TheBakerCat commented 2 years ago

I can confirm the issue.

i5-11400h + RTX 3050ti laptop nvidia-dkms 515.43.04-2 + 5.18.zen1-1

ibt=off fixes the issue

TheBakerCat commented 2 years ago

The following patch will insert the necessary endbr64 instructions:

diff --git a/src/nvidia-modeset/Makefile b/src/nvidia-modeset/Makefile
index c63b86b..69490d0 100644
--- a/src/nvidia-modeset/Makefile
+++ b/src/nvidia-modeset/Makefile
@@ -95,7 +95,6 @@ CFLAGS += -ffunction-sections
 CFLAGS += -fdata-sections
 CFLAGS += -ffreestanding

-CONDITIONAL_CFLAGS := $(call TEST_CC_ARG, -fcf-protection=none)
 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -Wformat-overflow=2)
 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -Wformat-truncation=1)
 ifeq ($(TARGET_ARCH),x86_64)
diff --git a/src/nvidia/Makefile b/src/nvidia/Makefile
index 9bdb826..cc05ab7 100644
--- a/src/nvidia/Makefile
+++ b/src/nvidia/Makefile
@@ -119,8 +119,6 @@ CFLAGS += -fdata-sections
 NV_KERNEL_O_LDFLAGS += --gc-sections
 EXPORTS_LINK_COMMAND = exports_link_command.txt

-CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -fcf-protection=none)
-
 ifeq ($(TARGET_ARCH),x86_64)
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mindirect-branch-register)
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mindirect-branch=thunk-extern)

Is there anyone facing these problems that can try rebuilding the modules with the patch and report back?

I'm not sure why the -fcf-protection=none is there in the first place, but I expect it was an attempt to minimize the code size.

i tried to rebuild and install modules with this patch problem still exists

ibt=off still allows you to boot normally

May 30 20:52:25 laptop kernel: ---[ end trace 0000000000000000 ]---
May 30 20:52:25 laptop kernel: Modules linked in: nvidia(OE+) i915 intel_gtt drm_buddy video drm_dp_helper ttm btrfs blake2b_generic libcrc32c crc32c_generic crc32c_intel xor raid6_pq
May 30 20:52:25 laptop kernel:  </TASK>
May 30 20:52:25 laptop kernel: R13: 000055a5562dab40 R14: 000055a5562dacb0 R15: 000055a5562dd860
May 30 20:52:25 laptop kernel: R10: 0000000000000003 R11: 0000000000000246 R12: 000055a5562db4d0
May 30 20:52:25 laptop kernel: RBP: 0000000000060000 R08: 0000000000000000 R09: 00007ffd6869c880
May 30 20:52:25 laptop kernel: RDX: 0000000000000000 RSI: 000055a5562db4d0 RDI: 0000000000000003
May 30 20:52:25 laptop kernel: RAX: ffffffffffffffda RBX: 000055a5562dacb0 RCX: 00007fae296df67d
May 30 20:52:25 laptop kernel: RSP: 002b:00007ffd6869c748 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
May 30 20:52:25 laptop kernel: Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d eb 26 0f >
May 30 20:52:25 laptop kernel: RIP: 0033:0x7fae296df67d
May 30 20:52:25 laptop kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 30 20:52:25 laptop kernel:  ? do_syscall_64+0x6b/0x90
May 30 20:52:25 laptop kernel:  ? syscall_exit_to_user_mode+0x26/0x50
May 30 20:52:25 laptop kernel:  ? __x64_sys_lseek+0x6d/0xc0
May 30 20:52:25 laptop kernel:  do_syscall_64+0x5c/0x90
May 30 20:52:25 laptop kernel:  __x64_sys_finit_module+0xc1/0x130
May 30 20:52:25 laptop kernel:  do_init_module+0x4a/0x240
May 30 20:52:25 laptop kernel:  do_one_initcall+0x118/0x2d0
May 30 20:52:25 laptop kernel:  nvidia_frontend_init_module+0x50/0x91 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel:  ? nvidia_init_module+0x627/0x627 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel:  nvidia_init_module+0x22e/0x627 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel:  rm_init_rm+0x9/0x10 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel:  RmInitRm+0x9/0x20 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel:  coreInitializeRm+0x24/0x90 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel:  portInitialize+0x2b/0x40 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel:  portCryptoInitialize+0x2a/0x40 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel:  portCryptoPseudoRandomGeneratorCreate+0x16/0xb0 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel:  ? nvidia_init_module+0x627/0x627 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel:  _portMemAllocatorAlloc+0x2e/0x170 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel:  ? portCryptoPseudoRandomGeneratorGetU32+0x30/0x30 [nvidia 416207d86ba54fc0cbe32354e28a12d664d17d2d]
May 30 20:52:25 laptop kernel: R13: 0000000000000000 R14: ffffb4d640687ca6 R15: 0000000000000000
May 30 20:52:25 laptop kernel: R10: ffffffffc0d8d610 R11: 0000000000000000 R12: ffffffffc0d8fbb0
May 30 20:52:25 laptop kernel: RBP: ffffb4d640687b70 R08: 0000000000000020 R09: ffffffffc0d8fc30
May 30 20:52:25 laptop kernel: RDX: 0000000000000000 RSI: 000000000000002c RDI: ffffffffc0d8fbb0
May 30 20:52:25 laptop kernel: RAX: ffffffffc0b46280 RBX: 0000000000000010 RCX: 0000000000000000
May 30 20:52:25 laptop kernel: RSP: 0018:ffffb4d640687b58 EFLAGS: 00010202
May 30 20:52:25 laptop kernel: Code: 08 48 89 d0 48 89 0f 48 c1 e0 17 48 31 c2 48 89 c8 48 c1 e8 05 48 31 c8 48 31 d0 48 c1 ea 12 48 31 d0 48 89 47 08 01 c8 c3 90 <48> 89 f7 e9 38 0f 00 00 0f 1f 84 00 00 00 00 >
May 30 20:52:25 laptop kernel: RIP: 0010:_portMemAllocatorAllocNonPagedWrapper+0x0/0x10 [nvidia]
May 30 20:52:25 laptop kernel:  asm_exc_control_protection+0x22/0x30
May 30 20:52:25 laptop kernel:  <TASK>
May 30 20:52:25 laptop kernel: Call Trace:
May 30 20:52:25 laptop kernel: PKRU: 55555554
May 30 20:52:25 laptop kernel: CR2: 000055a5562dd000 CR3: 00000001032da005 CR4: 0000000000f70ee0
May 30 20:52:25 laptop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 30 20:52:25 laptop kernel: FS:  00007fae295d1740(0000) GS:ffff9e5d60300000(0000) knlGS:0000000000000000
May 30 20:52:25 laptop kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
May 30 20:52:25 laptop kernel: R10: ffffffffac25aa20 R11: ffff9e59c116f600 R12: 0000000000000000
May 30 20:52:25 laptop kernel: RBP: 0000000000000003 R08: 0000000000000001 R09: 00000000ffffffea
May 30 20:52:25 laptop kernel: RDX: 0000000000000000 RSI: 00000000ffffefff RDI: 0000000000000003
May 30 20:52:25 laptop kernel: RAX: 000000000000004d RBX: ffffb4d640687aa8 RCX: 0000000000000000
May 30 20:52:25 laptop kernel: RSP: 0018:ffffb4d640687a88 EFLAGS: 00010002
May 30 20:52:25 laptop kernel: Code: 8b 93 80 00 00 00 be f9 00 00 00 48 c7 c7 5e 8c a6 ab e8 71 80 30 ff e9 72 ff ff ff 48 c7 c7 45 8c a6 ab e8 35 2f fa ff 0f 0b <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 66 >
May 30 20:52:25 laptop kernel: RIP: 0010:exc_control_protection+0xc2/0xd0
May 30 20:52:25 laptop kernel: Hardware name: Acer Nitro AN515-57/Scala_TLS, BIOS V1.11 09/28/2021
May 30 20:52:25 laptop kernel: CPU: 4 PID: 191 Comm: modprobe Tainted: G        W  OE     5.18.0-zen1-1-zen #1 8c1b4772d057e8d6ef1ec6c49ac9700bcd2a2e4e
May 30 20:52:25 laptop kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
May 30 20:52:25 laptop kernel: kernel BUG at arch/x86/kernel/traps.c:252!
May 30 20:52:25 laptop kernel: ------------[ cut here ]------------
May 30 20:52:25 laptop kernel: traps: Missing ENDBR: _portMemAllocatorAllocNonPagedWrapper+0x0/0x10 [nvidia]
May 30 20:52:25 laptop kernel: 
May 30 20:52:25 laptop kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
loqs commented 2 years ago

@TheBakerCat please try the following patch from https://bugs.archlinux.org/task/74886#comment208651

diff --git a/src/nvidia-modeset/Makefile b/src/nvidia-modeset/Makefile
index c63b86b..1e92bb0 100644
--- a/src/nvidia-modeset/Makefile
+++ b/src/nvidia-modeset/Makefile
@@ -95,7 +95,8 @@ CFLAGS += -ffunction-sections
 CFLAGS += -fdata-sections
 CFLAGS += -ffreestanding

-CONDITIONAL_CFLAGS := $(call TEST_CC_ARG, -fcf-protection=none)
+CONDITIONAL_CFLAGS := $(call TEST_CC_ARG, -fcf-protection=branch -mindirect-branch-register)
+CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mharden-sls=all)
 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -Wformat-overflow=2)
 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -Wformat-truncation=1)
 ifeq ($(TARGET_ARCH),x86_64)
diff --git a/src/nvidia/Makefile b/src/nvidia/Makefile
index 9bdb826..3f1e330 100644
--- a/src/nvidia/Makefile
+++ b/src/nvidia/Makefile
@@ -119,7 +119,8 @@ CFLAGS += -fdata-sections
 NV_KERNEL_O_LDFLAGS += --gc-sections
 EXPORTS_LINK_COMMAND = exports_link_command.txt

-CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -fcf-protection=none)
+CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -fcf-protection=branch -mindirect-branch-register)
+CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mharden-sls=all)

 ifeq ($(TARGET_ARCH),x86_64)
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mindirect-branch-register)
jimbo2150 commented 2 years ago

Patch works for me without the ibt=off parameter (GTX 1650).

benthetechguy commented 2 years ago

This is the beauty of open source. The exact cause and fix of the issue was found by the community, and even if NVIDIA for some reason refused to merge it, distros could just include the patch in their packages. If this was still the proprietary driver only, we would all have to include ibt=off until someone at NVIDIA found the issue and incorporated it into the next release.

TheBakerCat commented 2 years ago

@TheBakerCat please try the following patch from https://bugs.archlinux.org/task/74886#comment208651

diff --git a/src/nvidia-modeset/Makefile b/src/nvidia-modeset/Makefile
index c63b86b..1e92bb0 100644
--- a/src/nvidia-modeset/Makefile
+++ b/src/nvidia-modeset/Makefile
@@ -95,7 +95,8 @@ CFLAGS += -ffunction-sections
 CFLAGS += -fdata-sections
 CFLAGS += -ffreestanding

-CONDITIONAL_CFLAGS := $(call TEST_CC_ARG, -fcf-protection=none)
+CONDITIONAL_CFLAGS := $(call TEST_CC_ARG, -fcf-protection=branch -mindirect-branch-register)
+CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mharden-sls=all)
 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -Wformat-overflow=2)
 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -Wformat-truncation=1)
 ifeq ($(TARGET_ARCH),x86_64)
diff --git a/src/nvidia/Makefile b/src/nvidia/Makefile
index 9bdb826..3f1e330 100644
--- a/src/nvidia/Makefile
+++ b/src/nvidia/Makefile
@@ -119,7 +119,8 @@ CFLAGS += -fdata-sections
 NV_KERNEL_O_LDFLAGS += --gc-sections
 EXPORTS_LINK_COMMAND = exports_link_command.txt

-CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -fcf-protection=none)
+CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -fcf-protection=branch -mindirect-branch-register)
+CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mharden-sls=all)

 ifeq ($(TARGET_ARCH),x86_64)
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mindirect-branch-register)

maybe I'm doing something wrong, but it still doesn't work for me

marcSoda commented 2 years ago

ibt=off works for me, although my device still blocks on boot while loading nvidia drivers some of the time. Hopefully an update will come out soon.

xps15 9510

txangel commented 2 years ago

ibt=off works also for me in this relative order to IOMMU setup for PCI passthrough to VMs: BOOT_IMAGE=<XXXXX> root=<XXXXX> rw intel_iommu=on vfio-pci.ids=10de:0e1a nvidia-drm.modeset=1 ibt=off loglevel=3

I am not sure the order matters much but i placed it right after all the passthrough flags

this is running: 5.18.1 12th Gen Intel 12-129000K GPU 1: NVIDIA GeForce GTX 780 (passed through to vms) GPU 2: NVIDIA GeForce GTX 1080 (running the main system)

Redekian commented 2 years ago

The patch worked for me

I'm running: Linux 5.18.1 Intel Core i3-12100F NVIDIA GeForce RTX 3050

mtijanic commented 2 years ago

Tracking internally as bug 3665573

atiensivu commented 2 years ago

Same issue here. Kernel: 5.18 - GPU Driver: NVIDIA 515.43.04 - Rtx 3080 desktop

@atiensivu thank you, what does ibt=off do? It boots if I use it

Turns off IBT at 'run-time' for scenarios like this. Ideal fix is to make the driver work with IBT, but in the meantime, to get things working, this is a good workaround.

theGeekyLad commented 2 years ago

I see an update to linux-5.18.1.arch1-1 and nvidia-515.43.04-7. Can anyone confirm if it has been fixed in 5.18.1?

ekaradon commented 2 years ago

Not tested but it seems so, see here: https://bugs.archlinux.org/task/74886#comment208658

Redekian commented 2 years ago

Not working for me on 5.18.1

fpeterek commented 2 years ago

I have updated to the latest kernel and drivers, but the issue has not been fixed, at least on my machine. I still have to disable ibt when booting the system.

I use an RTX3070 and an i5-12600KF.

❯ pacman -Q nvidia
nvidia 515.43.04-7
❯ pacman -Q linux
linux 5.18.1.arch1-1
haselwarter commented 2 years ago

Issue persists on 5.18.1 (archlinux), with either the nvidia or nvidia-open drivers installed. Irritatingly, the module "fails to load" even when no Nvidia GPU is connected.

ekaradon commented 2 years ago

For Archlinux, I've misread the issue, it should be fixed in the nvidia-515.43.04-8 version. Which isn't yet published: https://archlinux.org/packages/extra/x86_64/nvidia/ But you can still try it by downloading the package on the issue and install it from there.

AbbasMZ commented 2 years ago

Assuming nvidia-open-515.43.04-9 includes the mentioned fix, I just tried it with linux 5.18.1 and the issue is still there for me.

I use Arch with a laptop that has i7-1165G7 and the nvidia driver is for an external gpu which is not currently connected.

Edit: Using the updated nvidia-open driver resolved the issue.

xsrvmy commented 2 years ago

Regarding the ibt=off workaround: Does setting this flag make the kernel less secure than 5.17? Or was IBT only added in 5.18?

JaroDevelop commented 2 years ago

For Archlinux, I've misread the issue, it should be fixed in the nvidia-515.43.04-8 version. Which isn't yet published: https://archlinux.org/packages/extra/x86_64/nvidia/ But you can still try it by downloading the package on the issue and install it from there.

Are you referring to the Nvidia LTS Branch? I think I found the -8 version you were talking about. Also -7 -8 -9 are just referring to the Repositories Nvidia, Nivida-lts and the Nividia-open. 5.15.43.04-7 being the " Nividia " Repository, 5.15.43.04 -8 being Nividia-lts and 5.15.43.04-9 being Nvidia-open. I noticed this from the official repo page 19. https://archlinux.org/packages/?page=19&repo=Extra Edit- Just making sure my thinking is right here since I got a thumbs down but it seems to me that the -7 -8 and -9 is just a naming scheme for the 3 different repositories. Edit #2 it seems that the -8 or lts version is NOT the same as the nvidia-open-515.43.04-8 and correct me if I am wrong I think that -9 is not the updated version of that branch either. I think you have to install the exact nvidia-open-515.43.04-8 at https://bugs.archlinux.org/task/74886#comment208651 to install the fix

TheBakerCat commented 2 years ago

I see an update to linux-5.18.1.arch1-1 and nvidia-515.43.04-7. Can anyone confirm if it has been fixed in 5.18.1?

I can confirm that my problem was really solved with the patch from here https://github.com/NVIDIA/open-gpu-kernel-modules/issues/256#issuecomment-1141350315 and upgrade to 5.18.1-zen

latest version of nvidia-dkms still doesn't work without ibt=off

Assuming nvidia-open-515.43.04-9 includes the mentioned fix, I just tried it with linux 5.18.1 and the issue is still there for me.

I use Arch on Lenovo x1-carbon gen-9 and the nvidia driver is for an external gpu which is not currently connected.

I also tried updating to the latest nvidia-dkms(nvidia-dkms 515.48.07-1) and nvidia-utils packages available in the testing repository, but system still won't boot without ibt=off

JaroDevelop commented 2 years ago

~~I can confirm that my problem was comepletely solved with the patch from here https://github.com/NVIDIA/open-gpu-kernel-modules/issues/256#issuecomment-1141350315 Works with 5.18-arch1-1 !!!~~ Make sure you download the file here at the VERY bottom of the page https://bugs.archlinux.org/task/74886#comment208651 Extract it into your home folder. then in the Terminal cd nvidia-open

makepkg -g >> PKGBUILD

makepkg -si

I posted the terminal commands because I had no idea how to apply the patch. Just for the normies who haven't dabbled in patching packages like me. EDIT - I forgot the `si' thats for it to install.

Edit 2 - I think installing the patch this way should work. just the ArchTitus script may have messed up me installing any Nvidia open?

Edit 3 - Is a common outcome with this issue your Nvidia card not recognizing nvidia-open as a compatible driver at all? Or is it just the ArchTitus script causing that? This Picture is the result after I installed the custom patch IMG_20220604_170048412.jpg

bxkx commented 2 years ago

the -x numbering is an Arch specific package version bump, see https://wiki.archlinux.org/title/Arch_package_guidelines#Package_versioning In the case of Nvidia drivers, this often happens whenever there's a new kernel version. You can check under "View changes", for example:

515.43.04-6: linux 5.18.arch1-1 515.43.04-5: linux 5.17.9.arch1-1