ROCm / rocminfo

ROCm Application for Reporting System Info
Other
32 stars 30 forks source link

rocminfo causes kernel oops after: Unable to open /dev/kfd read-write: Operation not permitted #37

Closed devurandom closed 1 month ago

devurandom commented 4 years ago

System information

❯ inxi -GSC -xx
System:    Host: ernie Kernel: 5.7.13 x86_64 bits: 64 compiler: gcc v: 10.2.0 Desktop: N/A wm: kwin_x11 dm: SDDM 
           Distro: Gentoo Base System release 2.7 
CPU:       Topology: Quad Core model: AMD Ryzen 5 2400G with Radeon Vega Graphics bits: 64 type: MT MCP arch: Zen 
           L2 cache: 2048 KiB 
           flags: avx avx2 lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 57494 
           Speed: 1676 MHz min/max: 1600/3600 MHz Core speeds (MHz): 1: 1676 2: 1600 3: 2026 4: 1682 5: 2061 6: 1567 7: 1975 
           8: 1620 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Baffin [Radeon RX 550 640SP / RX 560/560X] vendor: ASUSTeK 
           driver: amdgpu v: kernel bus ID: 01:00.0 chip ID: 1002:67ff 
           Device-2: AMD Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] vendor: ASUSTeK driver: amdgpu v: kernel 
           bus ID: 0a:00.0 chip ID: 1002:15dd 
           Display: server: X.Org 1.20.8 driver: amdgpu compositor: kwin_x11 resolution: 2560x1080~60Hz 
           OpenGL: renderer: AMD RAVEN (DRM 3.37.0 5.7.13 LLVM 10.0.1) v: 4.6 Mesa 20.2.0-rc1 direct render: Yes

rocminfo is at version 3.5.0.

Problem

When I run rocminfo on my system, I see:

❯ rocminfo 
ROCk module is loaded
Unable to open /dev/kfd read-write: Operation not permitted
dschridde is member of render group
LoadLib(libhsa-ext-image64.so.1) failed: libhsa-ext-image64.so.1: cannot open shared object file: No such file or directory
[...hangs indefinitely...]

rocminfo is not SIGKILL-able at that point.

This is reproducible every time I run rocminfo.

Logs

dmesg prints during execution of rocminfo:

[  134.846268] Failure to set tba address. error -1.                                                                                                                                                                                                                                                                                                               [13/3037]
[  134.855988] Alloc host visible vram on small bar is not allowed              
[  134.856451] BUG: unable to handle page fault for address: 0000000000001000                                                                                                                                                                                                                                                                                               
[  134.856454] #PF: supervisor write access in kernel mode                                                                                                                                                                                                                                                                                                                  
[  134.856456] #PF: error_code(0x0002) - not-present page                                                                                                                                                                                                                                                                                                                   
[  134.856457] PGD 26b5b1067 P4D 26b5b1067 PUD 304736067 PMD 0                                                                                                                                                                                                                                                                                                              
[  134.856460] Oops: 0002 [#1] PREEMPT SMP NOPTI                                                                                                                                                                                                                                                                                                                            
[  134.856463] CPU: 1 PID: 3991 Comm: rocminfo Tainted: G                T 5.7.13 #2                                                                                                                                                                                                                                                                                        
[  134.856464] Hardware name: System manufacturer System Product Name/ROG STRIX B350-F GAMING, BIOS 5406 11/13/2019                                                                                                                                                                                                                                                         
[  134.856532] RIP: 0010:set_trap_handler+0x19/0x40 [amdgpu]                                                                                                                                                                                                                                                                                                                
[  134.856534] Code: 2e c1 4c 89 e7 e8 77 53 a5 c8 e9 0e ff ff ff 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 80 b8 00 02 00 00 00 74 15 48 8b 46 78 <48> 89 90 00 10 00 00 48 89 88 08 10 00 00 31 c0 c3 48 89 96 88 00
[  134.856535] RSP: 0018:ffffadeb09197df8 EFLAGS: 00010202                                                                                                                                                                                                                                                                                                                  
[  134.856537] RAX: 0000000000000000 RBX: ffffadeb09197e48 RCX: 00007fc1b8524000                                                                                                                                                                                                                                                                                            
[  134.856538] RDX: 00007fc1b8523000 RSI: ffff99e32e09c220 RDI: ffff99e41f54dc00                                                                                                                                                                                                                                                                                            
[  134.856539] RBP: ffff99e335adc800 R08: 0000000000000004 R09: ffff99e335adc800
[  134.856540] R10: 0000000000000001 R11: 0000000000000000 R12: ffff99e431b35c00
[  134.856541] R13: ffff99e335adc840 R14: ffff99e3bac78700 R15: 0000000040184b13
[  134.856542] FS:  00007fc1b7d04f00(0000) GS:ffff99e470640000(0000) knlGS:0000000000000000
[  134.856543] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  134.856544] CR2: 0000000000001000 CR3: 000000026b6fc000 CR4: 00000000003406e0
[  134.856545] Call Trace:
[  134.856600]  kfd_ioctl_set_trap_handler+0x5c/0x90 [amdgpu]
[  134.856654]  kfd_ioctl+0x2d2/0x3f0 [amdgpu]
[  134.856706]  ? kfd_ioctl_import_dmabuf+0x120/0x120 [amdgpu]
[  134.856710]  ksys_ioctl+0x88/0xc0
[  134.856712]  __x64_sys_ioctl+0x16/0x20
[  134.856714]  do_syscall_64+0x43/0x80
[  134.856717]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  134.856719] RIP: 0033:0x7fc1b7fca957
[  134.856721] Code: 41 5c c3 48 8b 05 39 b5 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 b5 0c 00 f7 d8 64 89 01 48
[  134.856722] RSP: 002b:00007ffe61801848 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  134.856723] RAX: ffffffffffffffda RBX: 00007ffe61801890 RCX: 00007fc1b7fca957
[  134.856724] RDX: 00007ffe61801890 RSI: 0000000040184b13 RDI: 0000000000000004
[  134.856725] RBP: 0000000040184b13 R08: 0000000000002000 R09: 00007fc1b8523020
[  134.856726] R10: 0000000000000000 R11: 0000000000000246 R12: 00005622077a5090
[  134.856726] R13: 0000000000000004 R14: 0000000000000000 R15: 0000000000000000
[  134.856728] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse nft_masq nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nf_nat_tftp nft_objref nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_
nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack amdgpu nf_defrag_ipv6 nf_defrag_ipv4 tun bridge stp llc ip_set cpufreq_conservative nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter cfg80211 edac_mce_amd snd_usb_audio ccm kvm_amd uvcvideo amd_io
mmu_v2 algif_aead gpu_sched cbc snd_usbmidi_lib snd_rawmidi des_generic libdes ttm ecb snd_seq_device arc4 libarc4 algif_skcipher kvm bluetooth gspca_vc032x snd_hda_codec_realtek snd_hda_codec_generic gspca_main cmac ledtrig_audio snd_hda_codec_hdmi ecdh_generic
[  134.856759]  snd_hda_intel ecc crc16 md4 snd_intel_dspcfg snd_hda_codec eeepc_wmi videobuf2_vmalloc videobuf2_memops asus_wmi drm_kms_helper snd_hda_core videobuf2_v4l2 battery sparse_keymap videobuf2_common snd_hwdep syscopyarea sysfillrect videodev snd_pcm sysimgblt irqbypass snd_timer sp5100_tco rfkill mc mousedev fb_sys_fops nls_iso8859_1 wmi_bmof input_l
eds joydev k10temp cec pcspkr snd i2c_piix4 nls_cp437 vfat soundcore fat rc_core gpio_amdpt evdev acpi_cpufreq mac_hid sch_fq_codel sctp drm agpgart lm92 dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio sd_mod ax88179_178a usbnet mii hid_steam hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ahci libahci libata aesni_intel 
crypto_simd cryptd glue_helper ccp scsi_mod xhci_pci igb xhci_hcd i2c_algo_bit dca rng_core wmi video backlight pinctrl_amd btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic crc32c_intel dm_mirror dm_region_hash dm_log dm_mod pkcs8_key_parser
[  134.856801] CR2: 0000000000001000
[  134.856803] ---[ end trace e7a932baccf300ce ]---
[  135.128514] RIP: 0010:set_trap_handler+0x19/0x40 [amdgpu]
[  135.128517] Code: 2e c1 4c 89 e7 e8 77 53 a5 c8 e9 0e ff ff ff 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 80 b8 00 02 00 00 00 74 15 48 8b 46 78 <48> 89 90 00 10 00 00 48 89 88 08 10 00 00 31 c0 c3 48 89 96 88 00
[  135.128519] RSP: 0018:ffffadeb09197df8 EFLAGS: 00010202
[  135.128520] RAX: 0000000000000000 RBX: ffffadeb09197e48 RCX: 00007fc1b8524000
[  135.128521] RDX: 00007fc1b8523000 RSI: ffff99e32e09c220 RDI: ffff99e41f54dc00
[  135.128522] RBP: ffff99e335adc800 R08: 0000000000000004 R09: ffff99e335adc800
[  135.128523] R10: 0000000000000001 R11: 0000000000000000 R12: ffff99e431b35c00
[  135.128524] R13: ffff99e335adc840 R14: ffff99e3bac78700 R15: 0000000040184b13
[  135.128525] FS:  00007fc1b7d04f00(0000) GS:ffff99e470640000(0000) knlGS:0000000000000000
[  135.128526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  135.128527] CR2: 0000000000001000 CR3: 000000026b6fc000 CR4: 00000000003406e0
[  134.856545] Call Trace:
[  134.856600]  kfd_ioctl_set_trap_handler+0x5c/0x90 [amdgpu]
[  134.856654]  kfd_ioctl+0x2d2/0x3f0 [amdgpu]
[  134.856706]  ? kfd_ioctl_import_dmabuf+0x120/0x120 [amdgpu]
[  134.856710]  ksys_ioctl+0x88/0xc0
[  134.856712]  __x64_sys_ioctl+0x16/0x20
[  134.856714]  do_syscall_64+0x43/0x80
[  134.856717]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  134.856719] RIP: 0033:0x7fc1b7fca957
[  134.856721] Code: 41 5c c3 48 8b 05 39 b5 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 b5 0c 00 f7 d8 64 89 01 48
[  134.856722] RSP: 002b:00007ffe61801848 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  134.856723] RAX: ffffffffffffffda RBX: 00007ffe61801890 RCX: 00007fc1b7fca957
[  134.856724] RDX: 00007ffe61801890 RSI: 0000000040184b13 RDI: 0000000000000004
[  134.856725] RBP: 0000000040184b13 R08: 0000000000002000 R09: 00007fc1b8523020
[  134.856726] R10: 0000000000000000 R11: 0000000000000246 R12: 00005622077a5090
[  134.856726] R13: 0000000000000004 R14: 0000000000000000 R15: 0000000000000000
[  134.856728] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq fuse nft_masq nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nf_nat_tftp nft_objref nf_conntrack_tftp nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack amdgpu nf_defrag_ipv6 nf_defrag_ipv4 tun bridge stp llc ip_set cpufreq_conservative nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter cfg80211 edac_mce_amd snd_usb_audio ccm kvm_amd uvcvideo amd_iommu_v2 algif_aead gpu_sched cbc snd_usbmidi_lib snd_rawmidi des_generic libdes ttm ecb snd_seq_device arc4 libarc4 algif_skcipher kvm bluetooth gspca_vc032x snd_hda_codec_realtek snd_hda_codec_generic gspca_main cmac ledtrig_audio snd_hda_codec_hdmi ecdh_generic
[  134.856759]  snd_hda_intel ecc crc16 md4 snd_intel_dspcfg snd_hda_codec eeepc_wmi videobuf2_vmalloc videobuf2_memops asus_wmi drm_kms_helper snd_hda_core videobuf2_v4l2 battery sparse_keymap videobuf2_common snd_hwdep syscopyarea sysfillrect videodev snd_pcm sysimgblt irqbypass snd_timer sp5100_tco rfkill mc mousedev fb_sys_fops nls_iso8859_1 wmi_bmof input_leds joydev k10temp cec pcspkr snd i2c_piix4 nls_cp437 vfat soundcore fat rc_core gpio_amdpt evdev acpi_cpufreq mac_hid sch_fq_codel sctp drm agpgart lm92 dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio sd_mod ax88179_178a usbnet mii hid_steam hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ahci libahci libata aesni_intel crypto_simd cryptd glue_helper ccp scsi_mod xhci_pci igb xhci_hcd i2c_algo_bit dca rng_core wmi video backlight pinctrl_amd btrfs blake2b_generic xor raid6_pq libcrc32c crc32c_generic crc32c_intel dm_mirror dm_region_hash dm_log dm_mod pkcs8_key_parser
[  134.856801] CR2: 0000000000001000
[  134.856803] ---[ end trace e7a932baccf300ce ]---
[  135.128514] RIP: 0010:set_trap_handler+0x19/0x40 [amdgpu]
[  135.128517] Code: 2e c1 4c 89 e7 e8 77 53 a5 c8 e9 0e ff ff ff 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 80 b8 00 02 00 00 00 74 15 48 8b 46 78 <48> 89 90 00 10 00 00 48 89 88 08 10 00 00 31 c0 c3 48 89 96 88 00
[  135.128519] RSP: 0018:ffffadeb09197df8 EFLAGS: 00010202
[  135.128520] RAX: 0000000000000000 RBX: ffffadeb09197e48 RCX: 00007fc1b8524000
[  135.128521] RDX: 00007fc1b8523000 RSI: ffff99e32e09c220 RDI: ffff99e41f54dc00
[  135.128522] RBP: ffff99e335adc800 R08: 0000000000000004 R09: ffff99e335adc800
[  135.128523] R10: 0000000000000001 R11: 0000000000000000 R12: ffff99e431b35c00
[  135.128524] R13: ffff99e335adc840 R14: ffff99e3bac78700 R15: 0000000040184b13
[  135.128525] FS:  00007fc1b7d04f00(0000) GS:ffff99e470640000(0000) knlGS:0000000000000000
[  135.128526] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  135.128527] CR2: 0000000000001000 CR3: 000000026b6fc000 CR4: 00000000003406e0

Other information

I also see exceptions and segfaults in Clover and ROCm's OpenCL implementation when executing clinfo:

Until recently rocminfo would segfault and eventually bring the whole kernel down with it:

Previously also HIP would freeze the system, possibly because it invokes rocminfo in the background: https://github.com/ROCm-Developer-Tools/HIP/issues/2132

fxkamd commented 3 years ago

I'm using my vacation to catch up with a long backlog of things I didn't get to. Sorry about the (very) late response.

Is this still an issue? I see you were using a 5.7 kernel, which wasn't supported by our DKMS driver at the time. Were you trying to backport it? Or were you using the KFD version included in the 5.7 kernel? I see the error "Failure to set tba address. error -1." At that point the process creation in KFD should have failed and any further ioctl call should have been impossible. I think we had some bugs handling error returns from kfd_create_process at some point, but those should have been fixed by now. That wouldn't fix the underlying TBA allocation error, but it would cause all ROCm apps to fail during initialization and prevent the kernel oops from happening.

Your system is also "interesting" because you have two different GPUs in it: a GFXv9 integrated GPU and a GFXv8 discrete GPU. We have some improvements for this situation in the current Thunk. The Thunk got better at supporting such mixed configurations since ROCm 3.9, by treating APUs like dGPUs in such configurations. But I'm not sure if all the kinks are worked out yet. Would be good to hear an update.

Thanks, and happy holidays.

ppanchad-amd commented 1 month ago

Closing as there is no update from user. Thanks