ROCm / ROCgdb

This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.
https://rocm.docs.amd.com/projects/ROCgdb/en/latest/
GNU General Public License v2.0
50 stars 9 forks source link

Printing Lambda Variable Captured by Value in a HIP Kernel Fails #24

Closed verdog closed 4 weeks ago

verdog commented 1 year ago

This code:

  1 #include "hip/hip_runtime.h"
  2
  3 template <typename L>
  4 __global__
  5 void callLambda(L l) {
  6   l();
  7 }
  8
  9 int main(const int argc, const char** argv) {
 10   const int num_floats = 1024 * 1024;
 11   const int bytes = num_floats * sizeof(float);
 12   const int block_size = 256;
 13   const int blocks = num_floats / block_size;
 14
 15   float *buf = (float*)malloc(bytes);
 16   float *d_buf;
 17   hipMalloc(&d_buf, bytes);
 18
 19   auto l = [=](){
 20     const int i = hipBlockDim_x * hipBlockIdx_x + hipThreadIdx_x;
 21     d_buf[i] = i;
 22   };
 23
 24   hipLaunchKernelGGL(callLambda, dim3(blocks), dim3(block_size), 0, 0, l);
 25   hipMemcpy(buf, d_buf, bytes, hipMemcpyDeviceToHost);
 26
 27   free(buf);
 28   hipFree(d_buf);
 29
 30   return 0;
 31 }

launches a simple kernel that calls the passed in lambda. The kernel works as expected, setting every value in d_buf and then buf to its index.

If I set a breakpoint in the lambda code, rocgdb can't print the values of items that were captured by value, but can print the local variable i:

$ hipcc -g -O0 --offload-arch=gfx908 bug.cpp -o bug
$ rocgdb ./bug
GNU gdb (rocm-rel-5.5-74) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./bug...
(gdb) break 27
Breakpoint 1 at 0x217650: file bug.cpp, line 27.
(gdb) break 21
Breakpoint 2 at 0x2175b9: file bug.cpp, line 24.
(gdb) r
Starting program: /cray/css/users/chandlej/home/chandlej/raja/raja/build/HIP-Examples/mini-nbody/hip/bug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffe17a7700 (LWP 1606283)]
[New Thread 0x7ffbc77ff700 (LWP 1606284)]
[Thread 0x7ffbc77ff700 (LWP 1606284) exited]

Thread 1 "bug" hit Breakpoint 2, main (argc=1, argv=0x7fffffff9438) at bug.cpp:24
24        hipLaunchKernelGGL(callLambda, dim3(blocks), dim3(block_size), 0, 0, l);
(gdb) c
Continuing.
[New Thread 0x7ffba6fff700 (LWP 1606285)]
[Thread 0x7ffba6fff700 (LWP 1606285) exited]
[New Thread 0x7fffcf247700 (LWP 1606286)]
[Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]

Thread 6 "bug" hit Breakpoint 2, with lanes [0-63], main::{lambda()#1}::operator()() const (this=0x2000000000008) at bug.cpp:21
21          d_buf[i] = i;
(gdb) p i // printing the local variable is ok
$1 = 0
(gdb) p d_buf // printing the captured pointer is not ok
Cannot access memory at address 0x2000000000008
(gdb) n
[Switching to thread 2152, lane 0 (AMDGPU Lane 1:2:1:2147/0 (536,0,0)[128,0,0])]

Thread 2152 "bug" hit Breakpoint 2, with lanes [0-63], main::{lambda()#1}::operator()() const (this=0x2000000000008) at bug.cpp:21
21          d_buf[i] = i;
(gdb) p i
$2 = 137344
(gdb) p d_buf
Cannot access memory at address 0x2000000000008
(gdb) disable 2
(gdb) c
Continuing.
[Switching to thread 1 (Thread 0x7fffed9b5a80 (LWP 1606274))]

Thread 1 "bug" hit Breakpoint 1, main (argc=1, argv=0x7fffffff9438) at bug.cpp:27
27        free(buf);
(gdb) p buf[0]@10
$3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
(gdb) p buf[777777]@10
$4 = {777777, 777778, 777779, 777780, 777781, 777782, 777783, 777784, 777785, 777786}
(gdb) q

This workflow is common in RAJA, see their example programs.

ROCm Versions

$ hipcc --version
HIP version: 5.5.30202-eaf00c0b
AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.5.1 23194 69ef12a7c3cc5b0ccf820bc007bd87e8b3ac3037)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm-5.5.1/llvm/bin

$ rocgdb --version
GNU gdb (rocm-rel-5.5-74) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Using an MI100 GPU.

lancesix commented 4 months ago

Thanks for reporting the issue, and please accept apologies for the delayed response. There is an internal ticket to track this issue so we can provide a long term solution.

In the meantime, here is a quick patch that would work-around the issue you see. This is not a permanent (or clean) solution, but could unblock someone experiencing an issue similar to this one.

diff --git a/gdb/amd-dbgapi-target.c b/gdb/amd-dbgapi-target.c
index 830ef40644a..9c079069a11 100644
--- a/gdb/amd-dbgapi-target.c
+++ b/gdb/amd-dbgapi-target.c
@@ -1047,6 +1047,12 @@ amd_dbgapi_target::xfer_partial (enum target_object object, const char *annex,
   uint64_t dwarf_address_space
     = (uint64_t) amdgpu_address_space_id_from_core_address (offset);

+  /* Promote GLOBAL addresses to GENERIC.
+
+     See DWARF_*_ADDRESS_CLASS in gdb/amdgpu-tdep.c.  */
+  if (dwarf_address_space == 0 /* DWARF_GLOBAL_ADDR_CLASS */)
+    dwarf_address_space = 1 /* DWARF_GENERIC_ADDR_CLASS */;
+
   amd_dbgapi_segment_address_t segment_address
     = amdgpu_segment_address_from_core_address (offset);
ppanchad-amd commented 1 month ago

@verdog Do you still need assistance with this ticket? If not, please close the ticket. Thanks!

verdog commented 4 weeks ago

I have visibility on the internal ticket and will watch it from there. Thanks!