NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.02k stars 1.24k forks source link

NULL pointer deference in GrabOwnership+0x4/0x40 #265

Open YusufKhan-gamedev opened 2 years ago

YusufKhan-gamedev commented 2 years ago

NVIDIA Open GPU Kernel Modules Version

ce3d74ff6b49f7ec0e5e0aa44417f668b0f7189b

Does this happen with the proprietary driver (of the same version) as well?

I cannot test this

Operating System and Version

Description: Fedora release 36 (Thirty Six)

Kernel Release

Linux fedora 5.17.9-300.fc36.x86_64 #1 SMP PREEMPT Wed May 18 15:08:23 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Hardware: GPU

Its a RTX 2060 from GIGABYTE, I am not going to install the proprietary tool that is suggested

Describe the bug

5.048788] ACPI: [Firmware Bug]: Invalid BIOS _PSS frequency found for processor 7: 0x80000000 MHz

[ 5.048788] ACPI: [Firmware Bug]: Invalid BIOS _PSS frequency found for processor 7: 0x80000000 MHz [ 5.048789] ACPI: [Firmware Bug]: Invalid BIOS _PSS frequency found for processor 7: 0x80000000 MHz [ 5.048789] ACPI: [Firmware Bug]: Invalid BIOS _PSS frequency found for processor 7: 0x80000000 MHz [ 5.048790] ACPI: [Firmware Bug]: Invalid BIOS _PSS frequency found for processor 7: 0x80000000 MHz [ 5.048790] ACPI: [Firmware Bug]: Invalid BIOS _PSS frequency found for processor 7: 0x80000000 MHz [ 5.048791] ACPI: [Firmware Bug]: No valid BIOS _PSS frequency found for processor 7 [ 5.048791] ACPI: [Firmware Bug]: BIOS needs update for CPU frequency support [ 5.696161] nvidia-gpu 0000:01:00.3: i2c timeout error e0000000 [ 5.696165] ucsi_ccg 0-0008: i2c_transfer failed -110 [ 5.696166] ucsi_ccg 0-0008: ucsi_ccg_init failed - -110 [ 5.696168] ucsi_ccg: probe of 0-0008 failed with error -110 [ 5.711771] kauditd_printk_skb: 136 callbacks suppressed [ 5.711772] audit: type=1130 audit(1653611711.576:145): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-udev-settle comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 5.751815] audit: type=1130 audit(1653611711.616:146): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-fsck@dev-disk-by\x2duuid-cd5cf0c9\x2db7ce\x2d41da\x2dbcf1\x2dae0ccb7c629a comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 5.763793] audit: type=1130 audit(1653611711.628:147): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-fsck@dev-disk-by\x2duuid-5B81\x2d8B7D comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 5.767791] EXT4-fs (sda2): mounted filesystem with ordered data mode. Quota mode: none. [ 5.797817] audit: type=1130 audit(1653611711.662:148): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dracut-shutdown comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 5.819709] audit: type=1130 audit(1653611711.684:149): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=plymouth-read-write comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 5.826745] audit: type=1130 audit(1653611711.691:150): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=import-state comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 5.875769] audit: type=1130 audit(1653611711.740:151): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 5.878325] audit: type=1334 audit(1653611711.742:152): prog-id=60 op=LOAD [ 5.878401] audit: type=1334 audit(1653611711.742:153): prog-id=61 op=LOAD [ 5.878447] audit: type=1334 audit(1653611711.743:154): prog-id=62 op=LOAD [ 5.911293] RPC: Registered named UNIX socket transport module. [ 5.911296] RPC: Registered udp transport module. [ 5.911296] RPC: Registered tcp transport module. [ 5.911296] RPC: Registered tcp NFSv4.1 backchannel transport module. [ 6.038002] Bluetooth: BNEP (Ethernet Emulation) ver 1.3 [ 6.038004] Bluetooth: BNEP filters: protocol multicast [ 6.038007] Bluetooth: BNEP socket layer initialized [ 6.223234] NET: Registered PF_QIPCRTR protocol family [ 6.837424] iwlwifi 0000:00:14.3: Conflict between TLV & NVM regarding enabling LAR (TLV = enabled NVM =disabled) [ 7.024052] iwlwifi 0000:00:14.3: Conflict between TLV & NVM regarding enabling LAR (TLV = enabled NVM =disabled) [ 9.509038] e1000e 0000:00:1f.6 eno2: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx [ 9.509088] IPv6: ADDRCONF(NETDEV_CHANGE): eno2: link becomes ready [ 10.035938] thermal cooling_device11: Setting cooling device state is deprecated [ 11.744620] rfkill: input handler disabled [ 12.420396] Bluetooth: RFCOMM TTY layer initialized [ 12.420400] Bluetooth: RFCOMM socket layer initialized [ 12.420422] Bluetooth: RFCOMM ver 1.11 [ 18.664787] rfkill: input handler enabled [ 50.945360] logitech-hidpp-device 0003:046D:1025.0007: HID++ 1.0 device connected. [ 463.244682] nvidia-modeset: Unloading [ 463.262190] NVOC: __nvoc_objDelete: Child class OBJIOVASPACE not freed from parent class OBJVMM.Allocator 00000000ba323f72 released with memory allocations [ 463.262212] [NvPort] [ 463.262213] NvPort memory tracking information for allocator 00000000ba323f72: [ 463.262213] ACTIVE: 1 allocations, 644 bytes allocated (616 useful, 28 meta) [ 463.262214] TOTAL: 150 allocations, 512133 bytes allocated (507933 useful, 4200 meta) [ 463.262215] PEAK: 148 allocations, 511980 bytes allocated (507836 useful, 4144 meta) [ 463.262216] [NvPort] [ 463.262230] nvidia-nvlink: Unregistered Nvlink Core, major device number 234 [ 463.281105] nvidia: unknown parameter 'modeset' ignored [ 463.281759] nvidia-nvlink: Nvlink Core is being initialized, major device number 234

[ 463.282385] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem [ 463.329634] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 515.43.04 Release Build (yusufkhan@) Tue May 24 06:08:38 PM EDT 2022 [ 463.334441] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 515.43.04 Release Build (yusufkhan@) Tue May 24 06:08:29 PM EDT 2022 [ 463.337283] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver [ 463.505116] NVRM kgspInitRm_IMPL: missing NVDEC0 engine, cannot initialize GSP-RM [ 463.505120] NVRM RmInitAdapter: Cannot initialize GSP firmware RM [ 463.505392] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x63:0x56:1689) [ 463.506360] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0 [ 463.506437] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice [ 463.506568] [drm:nv_drm_probe_devices [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to register device [ 463.506574] BUG: kernel NULL pointer dereference, address: 0000000000000040 [ 463.506576] #PF: supervisor read access in kernel mode [ 463.506578] #PF: error_code(0x0000) - not-present page [ 463.506579] PGD 0 P4D 0 [ 463.506581] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 463.506582] CPU: 5 PID: 838 Comm: systemd-logind Tainted: G OE 5.17.9-300.fc36.x86_64 #1 [ 463.506584] Hardware name: Micro-Star International Co., Ltd. MS-7B17/MPG Z390 GAMING EDGE AC (MS-7B17), BIOS A.A0 08/14/2020 [ 463.506585] RIP: 0010:GrabOwnership+0x4/0x40 [nvidia_modeset] [ 463.506613] Code: 48 89 de 31 d2 bf 06 00 00 00 e8 a7 48 04 00 b8 01 00 00 00 5b c3 31 c0 c3 00 00 00 00 00 00 00 00 00 00 00 00 00 48 83 ec 18 <8b> 57 40 b8 01 00 00 00 48 c7 44 24 08 00 00 00 00 85 d2 74 1c 48 [ 463.506615] RSP: 0018:ffffadab0114bbc8 EFLAGS: 00010292 [ 463.506616] RAX: ffffffffc19d3c30 RBX: ffff9ce2e5041000 RCX: 0000000000000000 [ 463.506617] RDX: 0000000000000001 RSI: ffff9ce24c08b400 RDI: 0000000000000000 [ 463.506618] RBP: ffff9ce2e5041000 R08: 00000000000000c0 R09: ffff9ce2f715db40 [ 463.506619] R10: 0000000000000001 R11: 0000000000000005 R12: ffff9ce2f715db40 [ 463.506619] R13: 0000000000000000 R14: ffff9ce24c08b410 R15: 00000000ed1ec828 [ 463.506620] FS: 00007fb052144bc0(0000) GS:ffff9ce98dd40000(0000) knlGS:0000000000000000 [ 463.506622] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 463.506623] CR2: 0000000000000040 CR3: 000000010bb08001 CR4: 00000000003706e0 [ 463.506624] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 463.506624] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 463.506625] Call Trace: [ 463.506627] [ 463.506628] ? preempt_count_add+0x64/0x90 [ 463.506632] ? nv_drm_master_set+0x1e/0x40 [nvidia_drm] [ 463.506635] ? drm_new_set_master+0x90/0x110 [ 463.506638] ? drm_master_open+0x7c/0xa0 [ 463.506639] ? drm_open+0xf8/0x250 [ 463.506642] ? drm_stub_open+0xa2/0xe0 [ 463.506643] ? chrdev_open+0xb1/0x210 [ 463.506645] ? cdev_device_add+0x80/0x80 [ 463.506646] ? do_dentry_open+0x1c4/0x350 [ 463.506648] ? path_openat+0xacd/0x1210 [ 463.506651] ? path_lookupat+0x97/0x190 [ 463.506653] ? do_filp_open+0xa1/0x130 [ 463.506654] ? check_object_size+0x126/0x140 [ 463.506657] ? _raw_spin_unlock+0x16/0x30 [ 463.506660] ? alloc_fd+0xd1/0x170 [ 463.506661] ? do_sys_openat2+0x76/0x130 [ 463.506663] ? x64_sys_openat+0x5c/0x70 [ 463.506664] ? do_syscall_64+0x37/0x80 [ 463.506666] ? entry_SYSCALL_64_after_hwframe+0x44/0xae [ 463.506669] [ 463.506670] Modules linked in: nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc vfat fat snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci intel_rapl_msr snd_sof_xtensa_dsp intel_rapl_common snd_sof soundwire_bus snd_soc_skl intel_tcc_cooling snd_soc_hdac_hda x86_pkg_temp_thermal intel_powerclamp snd_hda_ext_core coretemp mei_hdcp mei_pxp iTCO_wdt iwlmvm snd_soc_sst_ipc snd_soc_sst_dsp ucsi_ccg intel_pmc_bxt iTCO_vendor_support typec_ucsi ee1004 snd_soc_acpi_intel_match typec mac80211 snd_soc_acpi kvm_intel snd_soc_core libarc4 snd_compress kvm snd_hda_codec_realtek ac97_bus snd_hda_codec_generic [ 463.506697] iwlwifi snd_pcm_dmaengine snd_hda_codec_hdmi ledtrig_audio irqbypass rapl snd_hda_intel intel_cstate iwlmei btusb snd_intel_dspcfg btrtl intel_uncore snd_intel_sdw_acpi btbcm cfg80211 snd_hda_codec btintel pcspkr snd_hda_core btmtk mei_me i2c_i801 intel_wmi_thunderbolt wmi_bmof snd_hwdep i2c_smbus mei bluetooth snd_seq snd_seq_device snd_pcm snd_timer ecdh_generic joydev rfkill snd intel_pch_thermal i2c_nvidia_gpu soundcore acpi_tad acpi_pad zram hid_logitech_hidpp hid_logitech_dj nouveau crct10dif_pclmul crc32_pclmul crc32c_intel e1000e ghash_clmulni_intel drm_ttm_helper ttm mxm_wmi wmi video ip6_tables ip_tables ipmi_devintf ipmi_msghandler fuse [last unloaded: nvidia] [ 463.506720] CR2: 0000000000000040 [ 463.506722] ---[ end trace 0000000000000000 ]--- [ 463.506723] RIP: 0010:GrabOwnership+0x4/0x40 [nvidia_modeset] [ 463.506741] Code: 48 89 de 31 d2 bf 06 00 00 00 e8 a7 48 04 00 b8 01 00 00 00 5b c3 31 c0 c3 00 00 00 00 00 00 00 00 00 00 00 00 00 48 83 ec 18 <8b> 57 40 b8 01 00 00 00 48 c7 44 24 08 00 00 00 00 85 d2 74 1c 48 [ 463.506742] RSP: 0018:ffffadab0114bbc8 EFLAGS: 00010292 [ 463.506743] RAX: ffffffffc19d3c30 RBX: ffff9ce2e5041000 RCX: 0000000000000000 [ 463.506744] RDX: 0000000000000001 RSI: ffff9ce24c08b400 RDI: 0000000000000000 [ 463.506745] RBP: ffff9ce2e5041000 R08: 00000000000000c0 R09: ffff9ce2f715db40 [ 463.506746] R10: 0000000000000001 R11: 0000000000000005 R12: ffff9ce2f715db40 [ 463.506746] R13: 0000000000000000 R14: ffff9ce24c08b410 R15: 00000000ed1ec828 [ 463.506747] FS: 00007fb052144bc0(0000) GS:ffff9ce98dd40000(0000) knlGS:0000000000000000 [ 463.506748] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 463.506749] CR2: 0000000000000040 CR3: 000000010bb08001 CR4: 00000000003706e0 [ 463.506750] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 463.506750] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 463.722840] show_signal: 7 callbacks suppressed [ 463.722841] traps: xss-lock[1807] trap int3 ip:7f1767595df1 sp:7ffc84704890 error:0 [ 463.722845] fbcon: Taking over console [ 463.722850] in libglib-2.0.so.0.7200.1[7f1767559000+91000] [ 463.724632] Console: switching to colour frame buffer device 128x48 [ 464.801771] rfkill: input handler disabled [ 471.179449] rfkill: input handler enabled

To Reproduce

Reload nvidia drivers

Bug Incidence

Once

nvidia-bug-report.log.gz

I believe the dmesg would be enough, it includes a core dump but here it is: nvidia-bug-report.log.gz

More Info

No response

aritger commented 2 years ago

This looks like bad error handling in response to this failure:

[ 463.505116] NVRM kgspInitRm_IMPL: missing NVDEC0 engine, cannot initialize GSP-RM
[ 463.505120] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 463.505392] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x63:0x56:1689)
[ 463.506360] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

The NVDEC0 problem is https://github.com/NVIDIA/open-gpu-kernel-modules/issues/116 which will be fixed in our next release.

YusufKhan-gamedev commented 2 years ago

@aritger Still a issue after latest release:

[  136.672286] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[  136.672456] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[  136.672490] RIP: 0010:GrabOwnership+0x4/0x40 [nvidia_modeset]
[  136.672534]  ? nv_drm_master_set+0x1e/0x40 [nvidia_drm]
[  136.672573] Modules linked in: nvidia_drm(OE) nvidia_modeset(OE) nvidia(OE) rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc vfat fat snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof intel_rapl_msr intel_rapl_common soundwire_bus snd_soc_skl intel_tcc_cooling snd_soc_hdac_hda snd_hda_ext_core snd_soc_sst_ipc x86_pkg_temp_thermal iwlmvm snd_soc_sst_dsp intel_powerclamp snd_soc_acpi_intel_match coretemp ucsi_ccg iTCO_wdt snd_soc_acpi typec_ucsi intel_pmc_bxt typec iTCO_vendor_support mei_hdcp ee1004 mei_pxp mac80211 snd_soc_core kvm_intel snd_hda_codec_realtek snd_hda_codec_generic libarc4 kvm snd_compress
[  136.672599]  ledtrig_audio snd_hda_codec_hdmi ac97_bus snd_pcm_dmaengine irqbypass btusb iwlwifi rapl btrtl snd_hda_intel intel_cstate snd_intel_dspcfg btbcm snd_intel_sdw_acpi intel_uncore pcspkr snd_hda_codec iwlmei btintel btmtk intel_wmi_thunderbolt wmi_bmof snd_hda_core i2c_i801 cfg80211 snd_hwdep i2c_smbus bluetooth snd_seq mei_me snd_seq_device mei snd_pcm snd_timer ecdh_generic joydev rfkill snd i2c_nvidia_gpu soundcore intel_pch_thermal acpi_tad acpi_pad zram hid_logitech_hidpp hid_logitech_dj crct10dif_pclmul crc32_pclmul crc32c_intel mxm_wmi e1000e ghash_clmulni_intel wmi video ip6_tables ip_tables ipmi_devintf ipmi_msghandler fuse [last unloaded: nvidia]
[  136.672623] RIP: 0010:GrabOwnership+0x4/0x40 [nvidia_modeset]
PAR2020 commented 2 years ago

NVBug 3667921

qWici commented 1 year ago

Same error. GTX 3070. Ubuntu 22.04.1. Temporary downgrade driver to 510. Wait for fix in next releases)

aritger commented 10 months ago

@YusufKhan-gamedev , @qWici : if you still see this, I'd be curious to know (a) What driver version, and (b) The kernel log leading up to the failure. Since the missing NVDEC0 engine, cannot initialize GSP-RM bug is fixed, if GSP is still failing to initialize for you, I'd like to see the reason.

We should still fix the error handling path that causes the NULL dereference in GrabOwnership(). If you still see that, could you test this trivial error check earlier in that path?

diff --git a/kernel-open/nvidia-drm/nvidia-drm-drv.c b/kernel-open/nvidia-drm/nvidia-drm-drv.c
index e0ddb6cb279b..a64ea7f6b75d 100644
--- a/kernel-open/nvidia-drm/nvidia-drm-drv.c
+++ b/kernel-open/nvidia-drm/nvidia-drm-drv.c
@@ -620,6 +620,10 @@ static int __nv_drm_master_set(struct drm_device *dev,
 {
     struct nv_drm_device *nv_dev = to_nv_device(dev);

+    if (!nv_dev || !nv_dev->pDevice) {
+        return -EINVAL;
+    }
+
     /*
      * If this device is driving a framebuffer, then nvidia-drm already has
      * modeset ownership. Otherwise, grab ownership now.
qWici commented 10 months ago

@aritger nope, I already have another computer)

YusufKhan-gamedev commented 8 months ago

@aritger The current 565.blah release doesnt have this issue at first glance, but that might be due to this particular codepath not being triggered at all as /dev/nvidia* doesnt exist.