NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.06k stars 1.25k forks source link

nvswitch_get_link_handlers: initialize ->read_discovery_token method by default #647

Open dmonakhov opened 4 months ago

dmonakhov commented 4 months ago

Otherwise this can result in null pointer deference if HW is not healthy. Example: [ 66.499757] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 66.502446] #PF: supervisor instruction fetch in kernel mode [ 66.504631] #PF: error_code(0x0010) - not-present page [ 66.506625] PGD 800000011acc2067 P4D 800000011acc2067 PUD 10f48e067 PMD 0 [ 66.509260] Oops: 0010 [#1] SMP PTI [ 66.510647] CPU: 12 PID: 15050 Comm: nv-fabricmanage Tainted: G OE 5.15.0-1058-aws #64~20.04.1-Ubuntu [ 66.514613] Hardware name: Amazon EC2 p4d.24xlarge/, BIOS 1.0 10/16/2017 [ 66.517203] RIP: 0010:0x0 [ 66.518271] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [ 66.520897] RSP: 0018:ffffac89a361bd20 EFLAGS: 00010246 [ 66.522914] RAX: 0000000000000000 RBX: ffff91ec630fef10 RCX: 000000000000001a [ 66.525641] RDX: 0000000000000002 RSI: ffffac89a361bd28 RDI: ffff91ec4efb7000 [ 66.528366] RBP: ffffac89a361bd48 R08: 0000000000000000 R09: ffff915fbb4b1540 [ 66.531091] R10: 0000000000000000 R11: 0000000000000000 R12: ffff91ec4efb7000 [ 66.533821] R13: ffff91ec4efb6300 R14: 0000000000000000 R15: ffff91ec4efb7000 [ 66.536558] FS: 00007f43f1cf3700(0000) GS:ffff91e87d700000(0000) knlGS:0000000000000000 [ 66.539650] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 66.541885] CR2: ffffffffffffffd6 CR3: 00000004277fe005 CR4: 00000000007706e0 [ 66.544610] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 66.548862] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 66.553084] PKRU: 55555554 [ 66.555718] Call Trace: [ 66.558238] [ 66.560629] ? show_regs.cold+0x1a/0x1f [ 66.563644] ? die_body+0x20/0x70 [ 66.566550] ? __die+0x2b/0x37 [ 66.569302] ? page_fault_oops+0x136/0x2c0 [ 66.572432] ? do_user_addr_fault+0x303/0x660 [ 66.575667] ? hrtimer_cancel+0x15/0x30 [ 66.578703] ? futex_wait+0x1e9/0x260 [ 66.581669] ? exc_page_fault+0x77/0x170 [ 66.584737] ? asm_exc_page_fault+0x27/0x30 [ 66.587902] nvlink_core_read_link_discovery_token+0x73/0x80 [nvidia] [ 66.592041] nvlink_core_correlate_conn_by_token+0xb8/0xf0 [nvidia] [ 66.596073] nvlink_lib_ioctl_ctrl+0x10e9/0x2680 [nvidia] [ 66.599776] ? x64_sys_futex+0x7b/0x1c0 [ 66.602873] ? cond_resched+0x19/0x40 [ 66.605894] ? __kmalloc+0x1ab/0x4b0 [ 66.608815] nvlink_fops_unlocked_ioctl+0xd6/0x1c0 [nvidia] [ 66.612543] x64_sys_ioctl+0x95/0xd0 [ 66.615549] do_syscall_64+0x5c/0xc0 [ 66.618476] ? do_syscall_64+0x69/0xc0 [ 66.621483] entry_SYSCALL_64_after_hwframe+0x62/0xcc [ 66.624949] RIP: 0033:0x7f43f56115cb [ 66.627869] Code: 0f 1e fa 48 8b 05 c5 78 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 95 78 0d 00 f7 d8 64 89 01 48 [ 66.639398] RSP: 002b:00007f43f1cf2b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 66.645327] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f43f56115cb [ 66.649545] RDX: 00007f43f1cf2bbc RSI: 0000000080044e08 RDI: 0000000000000030 [ 66.653786] RBP: 0000000080044e08 R08: 00007f43dc000f30 R09: 00000000008db3f0 [ 66.658029] R10: 00007f43dc0008d0 R11: 0000000000000246 R12: 0000000001a253e0 [ 66.662271] R13: 00007f43f1cf2bbc R14: 0000000000000018 R15: 0000000001a2b598 [ 66.666508] [ 66.668919] Modules linked in: nvidia_uvm(OE) xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_addrtype iptable_filter bpfilter br_netfilter br\ drv(OE) overlay nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel msr nvidia(OE) ib_iser binfmt_misc ppdev rdma_cm iw_cm ib_cm libiscsi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_ssse3 sha1_ssse3 scsi_transpo\ crypto_simd psmouse input_leds cryptd serio_raw ena parport_pc parport efa(OE) ib_uverbs ib_core drm sunrpc efi_pstore ip_tables x_tables autofs4 [ 66.704520] CR2: 0000000000000000 [ 66.707368] ---[ end trace 4652d4c22dfad5b0 ]---

This issue affect at least 535 and 545 branches

CLAassistant commented 3 months ago

CLA assistant check
All committers have signed the CLA.