ROCm / ROCgdb

This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.
https://rocm.docs.amd.com/projects/ROCgdb/en/latest/
GNU General Public License v2.0
50 stars 9 forks source link

Rocgdb failing to launch application with "cannot attach to process" rc=-2 #11

Closed drtpotter closed 2 years ago

drtpotter commented 2 years ago

Hi there,

I'm trying to get rocgdb to debug an OpenCL application that is running on my Radeon VII. I'm using ROCM 5.0 with the ROCclr runtime on OpenSUSE tumbleweed.

Every time I try to run the application like this

rocgdb ./mat_mult_badmem_gdb.exe

and then type run I get this error

Starting program: 
mat_mult_badmem_gdb.exe 
Could not attach to process 107215 (rc=-2)

I have tried both the binary release of rocgdb (5.0.0) and have compiled rocgdb from Github sources according to the instructions. The outcome is the same with either binary or source releases and Google hasn't been my friend. Same thing happens while running rocgdb as root. The error does not occur if I use the Github source to compile and run vanilla gdb without the AMD-specific goodies, but then I can't access the GPU.

If you can provide any insight into what error code rc=-2 is and how I can work around this issue that would be great. It is preventing me from finishing a course section on how to debug OpenCL programs with rocgdb!

Kind regards, Toby

drtpotter commented 2 years ago

Bug is still present with the 5.1 release.

t-tye commented 2 years ago

@drtpotter please could you get a log using the rocgdb command:

set debug amdgpu log-level verbose

drtpotter commented 2 years ago

Hey @t-tye

Thanks for looking into this. After setting the debug I get this error. I'm using Linux Kernel 5.17.

(gdb) run
Starting program: ~/OpenCL_Course/course_material/L4_Debugging/mat_mult_badmem.exe 
amd-dbgapi: amd_dbgapi_process_attach (client_process_id=0x22236f0, process_id=0x23c4fc8) {
amd-dbgapi:    callback: get_os_pid (pid=0x7ffec8f03e04) {
amd-dbgapi:    callback: } = STATUS_SUCCESS, *pid=12657
amd-dbgapi:    attaching process_1 to OS process 12657
amd-dbgapi:    detached process_1
amd-dbgapi:    linux_driver_t statistics (pid 12657): 0 reads (0), 0 writes (0)
amd-dbgapi: fatal error: KFD_IOC_DBG_TRAP_GET_VERSION failed
amd-dbgapi: } = STATUS_FATAL
Could not attach to process 12657 (rc=-2)

Kind regards, Toby

jpsamaroo commented 2 years ago

@drtpotter mainline kernels are not supported by rocgdb right now, because the necessary debugging ioctls are not yet present upstream (and the above failing ioctl is one of those). You would need to be running the https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver kernel to have rocgdb work.

drtpotter commented 2 years ago

Hi @jpsamaroo

Oh ok, switching to a custom kernel seems quite the task! Would it be enough to switch to the proprietary AMD OpenCL driver, or can we expect this functionality to be in the mainline Linux kernel soon.

drtpotter commented 2 years ago

Hi folks, I got around the problem by using OpenSUSE Leap 15.3 and using amdgpu-install to install both legacy and rocr opencl implementations. I'm not sure what fixed it, but with that system it is working.