Closed markdewing closed 2 months ago
I updated to kernel 5.11.12 to try get the driver versions to match. Running rocgdb gives a different error (using modified dbg api library that runs check_version right away in attach - the non-modified version stops with the unhelpful error about update_queues failed):
(gdb) run
Starting program: /mnt/nvme/physics/codes/qmcpack/qmc_kernels/kernels/vector_add/hip/a.out
amd-dbgapi: fatal error: KFD_IOC_DBG_TRAP_GET_VERSION failed
In ROCdbgapi/src/kernel/kfd_ioctls.h, AMDKFD_IOC_DBG_TRAP is listed in "non-upstream ioctls" section.
Does I need to use the kernel in ROCK-Kernel-Driver to get rocgdb to work?
The 5.4.x (don't remember x) kernel that is default with Ubuntu 20.04 has the bug where once the screen blanks, it won't wake up again. That started me down a path of upgrading to kernels that have the fix for that issue.
Correct, you need ROCK since those ioctls are not yet upstream.
I was unclear on how DKMS and the rock-dkms packages worked. The problem was I wasn't installing the Ubuntu mainline kernels correctly - there are two linux-headers packages to install (one bare and one -generic), and I only tried installing the -generic one. It would fail to install properly, and the DKMS rebuild step would also fail.
Now I installed 5.4.106 correctly (this fixes the screen blank issue), and the DKMS rebuilds properly, and rocgdb seems to work.
@markdewing Apologies for the lack of response. Do you still need assistance with your ticket? Thanks!
No further assistance needed.
Running Ubuntu 20.04 with kernel 5.8.0-48-generic. GPU is Vega 56. ROCm is 4.1.
To be clear, the issue is that the debugger does not issue a warning message as to why it is failing.
When I start a simple HIP application under rocgdb, it fails after the run command, with no indication of why it failed.
After some poking around, I added a call to
os_driver().check_version();
toattach
in ROCdbgapi/src/process.cpp, rocgdb then issues the warning message about driver version.The 'attach' function put a version check into the shared library load callback for libhsa-runtime64.so.1, but rocgdb must be calling other debugger functions before that happens. Turning on logging gives