ROCm / ROCK-Kernel-Driver

AMDGPU Driver with KFD used by the ROCm project. Also contains the current Linux Kernel that matches this base driver
Other
327 stars 98 forks source link

NULL pointer dereference in kfd_dbgmgr_wave_control #70

Closed misos1 closed 1 week ago

misos1 commented 5 years ago

Calling hsaKmtDbgWavefrontControl causes kernel bug. Seems after this rocm is somehow "blocked" and system cannot be soft-rebooted so probably some locked mutex was not unlocked.

main.cpp:

#include <hc.hpp>
#include <hsa.h>
#include <hsakmt.h>

int main()
{
    hc::accelerator_view view = hc::accelerator().get_default_view();
    hsa_agent_t agent = *static_cast<hsa_agent_t*>(view.get_hsa_agent());
    unsigned int node;
    hsa_agent_get_info(agent, HSA_AGENT_INFO_NODE, &node);

    HsaDbgWaveMessage msg = {0};
    hsaKmtDbgWavefrontControl(node, HSA_DBG_WAVEOP_TRAP, HSA_DBG_WAVEMODE_SINGLE, 2, &msg);

    return 0;
}

Run:

hcc -hc -lhsa-runtime64 -lhsakmt main.cpp
./a.out

dmesg:

[  279.910283] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  279.910345] IP: kfd_dbgmgr_wave_control+0x12/0x60 [amdgpu]
[  279.910347] PGD 7e8155067 P4D 7e8155067 PUD 81419b067 PMD 0 
[  279.910352] Oops: 0000 [#1] SMP NOPTI
[  279.910422] CPU: 17 PID: 7520 Comm: a.out Tainted: G           OE    4.15.0-45-generic #48-Ubuntu
[  279.910424] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X399 Professional Gaming, BIOS P3.30 08/14/2018
[  279.910477] RIP: 0010:kfd_dbgmgr_wave_control+0x12/0x60 [amdgpu]
[  279.910478] RSP: 0018:ffff9c339056fd28 EFLAGS: 00010246
[  279.910481] RAX: ffff8dee7ce4b800 RBX: ffff9c339056fdb0 RCX: 0000000000000000
[  279.910482] RDX: 000000000000800b RSI: ffff9c339056fd38 RDI: 0000000000000000
[  279.910484] RBP: ffff9c339056fd28 R08: ffff9c3390570000 R09: 0000000000000020
[  279.910485] R10: 0000000000000020 R11: 0000000000000fa0 R12: ffff8deebcf27800
[  279.910486] R13: ffff8dee760cb440 R14: ffff8dee7ce4b800 R15: ffff8dee82a73200
[  279.910489] FS:  00007f284a99ec00(0000) GS:ffff8deedcc40000(0000) knlGS:0000000000000000
[  279.910490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  279.910492] CR2: 0000000000000000 CR3: 000000084e608000 CR4: 00000000003406e0
[  279.910493] Call Trace:
[  279.910544]  kfd_ioctl_dbg_wave_control+0x120/0x1a0 [amdgpu]
[  279.910593]  kfd_ioctl+0x271/0x450 [amdgpu]
[  279.910640]  ? kfd_ioctl_destroy_queue+0x70/0x70 [amdgpu]
[  279.910645]  ? __handle_mm_fault+0x478/0x5c0
[  279.910650]  do_vfs_ioctl+0xa8/0x630
[  279.910652]  ? handle_mm_fault+0xb1/0x1f0
[  279.910655]  ? __do_page_fault+0x270/0x4d0
[  279.910658]  SyS_ioctl+0x79/0x90
[  279.910662]  do_syscall_64+0x73/0x130
[  279.910666]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  279.910668] RIP: 0033:0x7f2848e1c5d7
[  279.910670] RSP: 002b:00007ffd97cd0f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  279.910672] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f2848e1c5d7
[  279.910673] RDX: 00000000010b6600 RSI: 0000000040104b10 RDI: 0000000000000003
[  279.910675] RBP: 00000000010b6600 R08: 00007ffd97cd0fd0 R09: 0000000000000000
[  279.910676] R10: 0000000001003010 R11: 0000000000000246 R12: 0000000040104b10
[  279.910677] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
[  279.910679] Code: c7 c8 bf 83 c0 e8 bf 0d 28 e5 48 c7 c0 ea ff ff ff eb d2 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 8b 06 48 89 e5 8b 90 90 00 00 00 <39> 17 75 11 48 8b 7f 10 48 8b 47 38 e8 9d fe 9b e5 48 98 5d c3 
[  279.910759] RIP: kfd_dbgmgr_wave_control+0x12/0x60 [amdgpu] RSP: ffff9c339056fd28
[  279.910760] CR2: 0000000000000000
[  279.910763] ---[ end trace 33bd6cf8014cbbaf ]---
ppanchad-amd commented 2 months ago

@misos1 Apologies for the lack of response. Can you please check if your issue still exist with the latest ROCm 6.2? If not, please close the ticket. Thanks!

ppanchad-amd commented 1 week ago

@misos1 Closing ticket. Please feel free to re-open ticket if you still see the issue with the latest ROCm. Thanks!

misos1 commented 1 week ago

Yes I forgot, this seems to be resolved now, also #71.