ROCm / ROCgdb

This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.
https://rocm.docs.amd.com/projects/ROCgdb/en/latest/
GNU General Public License v2.0
50 stars 9 forks source link

rocgdb fails to run target with no message about driver incompatibility #6

Closed markdewing closed 2 months ago

markdewing commented 3 years ago

Running Ubuntu 20.04 with kernel 5.8.0-48-generic. GPU is Vega 56. ROCm is 4.1.

To be clear, the issue is that the debugger does not issue a warning message as to why it is failing.

When I start a simple HIP application under rocgdb, it fails after the run command, with no indication of why it failed.

$ rocgdb ./a.out 
GNU gdb (rocm-rel-4.1-26) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./a.out...
(gdb) run
Starting program: /mnt/nvme/physics/codes/qmcpack/qmc_kernels/kernels/vector_add/hip/a.out 
amd-dbgapi: fatal error: queue_snapshot failed (rc=-1)
Backtrace:
    #0 0x00007fab9ba94c47 amd::dbgapi::process_t::update_queues() in /home/mdewing/nvme/software/amd/rocm_src/src_41/ROCdbgapi/src/process.cpp:1035
    #1 0x00007fab9bad50c3 amd_dbgapi_process_wave_list in /home/mdewing/nvme/software/amd/rocm_src/src_41/ROCdbgapi/src/wave.cpp:963
    #2 0x000055d7c90aa26e rocm_target_ops::update_thread_list() in /src/rocm-gdb/gdb/rocm-tdep.c:1330
    #3 0x000055d7c90c47d3 rocm_update_solib_list in /src/rocm-gdb/gdb/solib-rocm.c:525
    #4 0x000055d7c8f997a5 std::function<void (target_ops*, int)>::operator()(target_ops*, int) const in /usr/include/c++/7/bits/std_function.h:706
    #5 0x000055d7c8f997a5 gdb::observers::observable<target_ops*, int>::notify(target_ops*, int) const in /src/rocm-gdb/gdb/../gdbsupport/observable.h:106
    #6 0x000055d7c8f997a5 post_create_inferior(target_ops*, int) in /src/rocm-gdb/gdb/infcmd.c:350
    #7 0x000055d7c8f9a84d run_command_1 in /src/rocm-gdb/gdb/infcmd.c:523
    #8 0x000055d7c8e94fe1 cmd_func(cmd_list_element*, char const*, int) in /src/rocm-gdb/gdb/cli/cli-decode.c:2181
    #9 0x000055d7c912c4a7 execute_command(char const*, int) in /src/rocm-gdb/gdb/top.c:668
    #10 0x000055d7c8f47cab command_handler(char const*) in /src/rocm-gdb/gdb/event-top.c:589
    #11 0x000055d7c8f4804c command_line_handler(std::unique_ptr<char, gdb::xfree_deleter<char> >&&) in /src/rocm-gdb/gdb/event-top.c:774
    #12 0x000055d7c8f48658 gdb_rl_callback_handler in /src/rocm-gdb/gdb/event-top.c:219
    #13 0x000055d7c919ef16 rl_callback_read_char in /src/rocm-gdb/readline/readline/callback.c:281
    #14 0x000055d7c8f47215 gdb_rl_callback_read_char_wrapper_noexcept in /src/rocm-gdb/gdb/event-top.c:177
    #15 0x000055d7c8f4850f gdb_rl_callback_read_char_wrapper in /src/rocm-gdb/gdb/event-top.c:194
    #16 0x000055d7c8f4706f stdin_event_handler(int, void*) in /src/rocm-gdb/gdb/event-top.c:516
    #17 0x000055d7c925de8c gdb_wait_for_event in /src/rocm-gdb/gdbsupport/event-loop.cc:701
    #18 0x000055d7c925e050 gdb_do_one_event() in /src/rocm-gdb/gdbsupport/event-loop.cc:237
    #19 0x000055d7c8fed97c start_event_loop in /src/rocm-gdb/gdb/main.c:356
    #20 0x000055d7c8fed97c captured_command_loop in /src/rocm-gdb/gdb/main.c:416
    #21 0x000055d7c8fefa34 captured_main in /src/rocm-gdb/gdb/main.c:1253
    #22 0x000055d7c8fefa34 gdb_main(captured_main_args*) in /src/rocm-gdb/gdb/main.c:1268
    #23 0x000055d7c8dfc01a main in /src/rocm-gdb/gdb/gdb.c:32
    #24 0x00007fab9b59c0b2 __libc_start_main
    #25 0x000055d7c8e01ec9 _start
    #26 0xffffffffffffffff
amd_dbgapi_wave_list failed (rc=-2)
(gdb) 

After some poking around, I added a call to os_driver().check_version(); to attach in ROCdbgapi/src/process.cpp, rocgdb then issues the warning message about driver version.

(gdb) run
Starting program: /mnt/nvme/physics/codes/qmcpack/qmc_kernels/kernels/vector_add/hip/a.out 
amd-dbgapi: warning: AMD GPU driver version 1.1 does not match 1.3+ requirement
amd-dbgapi: fatal error: queue_snapshot failed (rc=-1)
Backtrace:
    #0 0x00007f1bf933cc47 amd::dbgapi::process_t::update_queues() in /home/mdewing/nvme/software/amd/rocm_src/src_41/ROCdbgapi/src/process.cpp:1035
...

The 'attach' function put a version check into the shared library load callback for libhsa-runtime64.so.1, but rocgdb must be calling other debugger functions before that happens. Turning on logging gives

Starting program: /mnt/nvme/physics/codes/qmcpack/qmc_kernels/kernels/vector_add/hip/a.out
amd-dbgapi: > amd_dbgapi_process_attach (0x558e2d64a260, 0x558e2d74eac8)
amd-dbgapi:    > [callback] get_os_pid ()
amd-dbgapi: attaching process_1 to process 86652
amd-dbgapi:    > [callback] enable_notify_shared_library (shared_library_1)
amd-dbgapi: > amd_dbgapi_process_get_info (process_1, PROCESS_INFO_NOTIFIER, 4, 0x558e2d74ead0)
amd-dbgapi: > amd_dbgapi_process_next_pending_event (process_1)
amd-dbgapi: > amd_dbgapi_process_code_object_list (process_1)
amd-dbgapi:    > [callback] allocate_memory ()
amd-dbgapi: > amd_dbgapi_process_wave_list (process_1)
amd-dbgapi: fatal error: process_t::update_agents failed (rc=-1)
Backtrace:
    #0 0x00007f900a2ed0c9 amd_dbgapi_process_wave_list
    #1 0x0000558e2c3c026e rocm_target_ops::update_thread_list() in /src/rocm-gdb/gdb/rocm-tdep.c:1330
 ...
markdewing commented 3 years ago

I updated to kernel 5.11.12 to try get the driver versions to match. Running rocgdb gives a different error (using modified dbg api library that runs check_version right away in attach - the non-modified version stops with the unhelpful error about update_queues failed):

(gdb) run
Starting program: /mnt/nvme/physics/codes/qmcpack/qmc_kernels/kernels/vector_add/hip/a.out 
amd-dbgapi: fatal error: KFD_IOC_DBG_TRAP_GET_VERSION failed

In ROCdbgapi/src/kernel/kfd_ioctls.h, AMDKFD_IOC_DBG_TRAP is listed in "non-upstream ioctls" section.

Does I need to use the kernel in ROCK-Kernel-Driver to get rocgdb to work?

The 5.4.x (don't remember x) kernel that is default with Ubuntu 20.04 has the bug where once the screen blanks, it won't wake up again. That started me down a path of upgrading to kernels that have the fix for that issue.

jpsamaroo commented 3 years ago

Correct, you need ROCK since those ioctls are not yet upstream.

markdewing commented 3 years ago

I was unclear on how DKMS and the rock-dkms packages worked. The problem was I wasn't installing the Ubuntu mainline kernels correctly - there are two linux-headers packages to install (one bare and one -generic), and I only tried installing the -generic one. It would fail to install properly, and the DKMS rebuild step would also fail.

Now I installed 5.4.106 correctly (this fixes the screen blank issue), and the DKMS rebuilds properly, and rocgdb seems to work.

ppanchad-amd commented 2 months ago

@markdewing Apologies for the lack of response. Do you still need assistance with your ticket? Thanks!

markdewing commented 2 months ago

No further assistance needed.