ROCm / ROCgdb

This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.
https://rocm.docs.amd.com/projects/ROCgdb/en/latest/
GNU General Public License v2.0
50 stars 9 forks source link

How do I trigger the debugger? #5

Closed jpsamaroo closed 3 years ago

jpsamaroo commented 3 years ago

I'm developing support for AMDGPU computing in Julia in the AMDGPU.jl package. We use HSA directly through ROCR, and do codegen directly through LLVM (we don't use HIP or OpenCL). I'd like to be able to use rocgdb to debug my programs to help speed up development, however I haven't been able to get rocgdb to recognize that I'm using the GPU at all. When running GPU kernels through Julia under rocgdb, info agents, info queues, etc. do not show any information at all, even though the kernel launches and completes successfully. We do emit DWARF debuginfo via LLVM into our HSA executables. I'm running with a very recent (~2 weeks) ROCK kernel with a similarly recent ROCgdb and ROCdbgapi. I can emit an LLVM debugtrap call, which is supposed to trigger the debugger, but it just hangs the process.

What I'd like to know is, how do I get rocgdb to recognize that I'm launching GPU kernels, and allow me to attach to the GPU process when it inevitably? Are the requirements for this documented somewhere?

t-tye commented 3 years ago

Try ‘set debug amdgpu log-level info’ before ‘run’ and report the log after running the application in ROCgdb.

jpsamaroo commented 3 years ago

I got nothing from info, but this is what I got with verbose:

Starting program: /home/jpsamaroo/bin/julia-master --project /home/jpsamaroo/amdgpu-simple.jl
amd-dbgapi: > amd_dbgapi_process_attach (0x7f7ecf7355e0, 0x7f7ec56d56a8)
amd-dbgapi:    > [callback] get_os_pid ()
amd-dbgapi:    > [callback] enable_notify_shared_library (shared_library_1)
amd-dbgapi: > amd_dbgapi_process_get_info (process_3, PROCESS_INFO_NOTIFIER, 4, 0x7f7ec56d56b0)
amd-dbgapi: > amd_dbgapi_next_pending_event (process_3)
amd-dbgapi: > amd_dbgapi_code_object_list (process_3)
amd-dbgapi:    > [callback] allocate_memory ()
[New LWP 18310]
[New LWP 18311]
[New LWP 18312]
[New LWP 18313]
[New LWP 18314]
[New LWP 18315]
[New LWP 18318]
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
'+fp64-fp16-denormals' is not a recognized feature for this target (ignoring feature)
'-fp32-denormals' is not a recognized feature for this target (ignoring feature)
[Detaching after vfork from child process 18382]
[Detaching after vfork from child process 18389]
[ Info: Done!

[LWP 18318 exited]
[LWP 18315 exited]
[LWP 18314 exited]
[LWP 18312 exited]
[LWP 18311 exited]
[LWP 18310 exited]
[LWP 18288 exited]
[Inferior 1 (process 18288) exited normally]
amd-dbgapi: > amd_dbgapi_process_detach (process_3)
amd-dbgapi:    > [callback] disable_notify_shared_library ()
jpsamaroo commented 3 years ago

From testing with a simple C debugger with ROCdbgapi integration, the thing rocgdb is doing wrong is somehow not reporting that libhsa-runtime.so.1 gets loaded (if it did, we'd expect to see the get_symbol_address and insert_breakpoint callbacks called). Julia is multithreaded, and code may not execute on the main thread; is this potentially a situation that rocgdb doesn't know how to handle?

t-tye commented 3 years ago

From testing with a simple C debugger with ROCdbgapi integration, the thing rocgdb is doing wrong is somehow not reporting that libhsa-runtime.so.1 gets loaded (if it did, we'd expect to see the get_symbol_address and insert_breakpoint callbacks called). Julia is multithreaded, and code may not execute on the main thread; is this potentially a situation that rocgdb doesn't know how to handle?

I am not sure I understand the question. rocgdb should report shared library loading in the same way as standard gdb. I am not familiar with Julia. Is that the program executing in the inferior that is being debugged? Note that the HIP language runtime does deferred code object loading as described in the AMD GPU section of the rocgdb User Manual installed in /opt/rocm/share/doc .

simark commented 3 years ago

I think @jpsamaroo is saying that they expect this to happen:

  1. dbgapi informs us (through the enable_notify_shared_library callback) that it is interested in knowing if/when libhsa-runtime.so.1 library ever gets loaded
  2. We take note in info->notify_solib_map
  3. They load libhsa-runtime.so.1 in their program
  4. rocm_target_solib_loaded gets called
  5. We inform dbgapi of it, which enables GPU debugging

If so, @jpsamaroo you would need to debug and/or add debug prints in rocm-tdep.c, find out if rocm_target_solib_loaded gets called with libhsa-runtime.so.1 or not, and work your way from there.

jpsamaroo commented 3 years ago

Found the issue! AMDGPU.jl (which is a package written for the Julia language) loads libhsa-runtime64.so, but dbgapi only requests to be notified for libhsa-runtime64.so.1, so the string comparisons in rocm_target_solib_loaded (https://github.com/ROCm-Developer-Tools/ROCgdb/blob/e4413be7472b6ec41a8d145dbe9385a907e380b2/gdb/rocm-tdep.c#L1367) fail to match our loaded library. Modifying this comparison to explicitly check for libhsa-runtime64.so causes dbgapi to be notified, and I get a working GPU debugger!

jpsamaroo commented 3 years ago

I'm going to modify AMDGPU.jl to only load libhsa-runtime64.so.1, but I also propose that ROCgdb compare libraries without the version suffix first, and then if it finds a matching library but the suffix of the readlink'd library is missing or is a different major version, we output a warning that behavior might be unexpected (or just don't notify dbgapi; but we should tell the user that they messed up). I'd be happy to put together a patch for this if it would be considered.

t-tye commented 3 years ago

Only the versioned library should be being loaded by a component, using its standard name, and rocgdb only supports the version of the library that it checks for. Linux allows hard links so to make what you describe robust would presumably involve getting the inode of the files to see if they match, it cannot simply be based on file path name suffix. I am not sure doing extra diagnosing of other components messing up is the responsibility of rocgdb:-)

jpsamaroo commented 3 years ago

I guess that's fair, but I do feel some concern that the fact that this problem was hard for me to figure out, might mean that someone else will be led down the same road. Is there any potential for at least logging the name of the shared library that dbgapi has requested be tracked?

t-tye commented 3 years ago

I think it is not unreasonable for language developers to understand how the system works and debug related issues. The OpenCL/HIP language runtimes are responsible for loading the correct ROCr runtime library which is versioned.

How is Julia interacting with the ROCr library? Is is linking in so that the normal loader will load it as dependence? Or is the Julia runtime dynamically loading it?

We are shipping rocdgbapi with rocr so they are always a matched pair so I am not sure where you suggest this to be logged. Would mentioning it as a dependency in the rocdbgapi README.md make sense?

jpsamaroo commented 3 years ago

I think that in the verbose logging from ROCdbgapi we could show which library is being tracked, by name (not just by ID). I haven't convinced myself that this is strictly required, but it might be something to consider.

Julia is loading the library dynamically (with dlopen IIRC), upon the first call to one of the library's functions. AMDGPU.jl is the one specifying the library to load, so I pushed some changes that force it to only load v1 of the library, which seems to be working.

I'm running on Alpine Linux, so ROCdbgapi and ROCR can certainly be out of sync, at least until someone steps up and provides APKBUILDs for these libraries (which I'll probably end up doing). I think a comment in the ROCdbgapi README would be very helpful, though.

t-tye commented 3 years ago

I added the following to the README.md file:

The ROCdbgapi library requires that the ROCr library is loaded in the inferior
to enable AMD GPU debugging.  This can be installed as part of the AMD ROCm
release by the ``hsa-rocr-dev`` package:

- ``libhsa-runtime64.so.1 ``

This should be part of the next push.

The plan is to eliminate the need to detect this library being loaded so the whole problem you encountered will go away:-) So not sure it is worth the effort to update the logging t this point.

jpsamaroo commented 3 years ago

Awesome, thanks so much @t-tye ! That all sounds good to me.