ROCm / ROCgdb

This is ROCgdb, the ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger.
https://rocm.docs.amd.com/projects/ROCgdb/en/latest/
GNU General Public License v2.0
48 stars 9 forks source link

Unknown CFA rule. A problem internal to GDB has been detected #9

Open zjin-lcf opened 2 years ago

zjin-lcf commented 2 years ago

While working with DPCPP (https://github.com/intel/llvm) compiling SYCL kernels targeting ROCm (gfx908) I came across the gdb error. It is true that there are some issues with compiling the SYCL kernels. I am not familiar with gdb internals. Does the gdb error provide some hint/clue to the compiler issue ?

https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/babelstream-sycl

rocgdb ./main
Precision: float
Array size: 134.2 MB (=0.1 GB)
Total size: 402.7 MB (=0.4 GB)
Memory access fault by GPU node-2 (Agent handle: 0x532080) on address 0x7ffeb0019000. Reason: Page not present or supervisor privilege.

Thread 6 "main" received signal SIGSEGV, Segmentation fault.
[Switching to AMDGPU Thread 2:3:1:1 (130944,0,0)/0]
copy<float>(cl::sycl::queue&, cl::sycl::buffer<float, 1, cl::sycl::detail::aligned_allocator<char>, void>&, cl::sycl::buffer<float, 1, cl::sycl::detail::aligned_allocator<char>, void>&)::{lambda(cl::sycl::handler&)#1}::operator()(cl::sycl::handler&) const::{lambda(cl::sycl::nd_item<1>)#1}::operator()(cl::sycl::nd_item<1>) const (
/home/release/git/aomp13/ROCgdb/gdb/dwarf2/frame.c:1029: internal-error: Unknown CFA rule.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
t-tye commented 2 years ago

Thank you for the report. We actually recently encountered the same issue and have a fix prepared in the compiler to address the issue of the CFI being incorrect. In addition, I believe gdb was not gracefully handling this and that issue was also being fixed.

Will report back once those fixes are upstreamed.

t-tye commented 2 years ago

ROCgdb requires correct CFI DWARF information to be present in the GPU code objects. The upstream LLVM compiler does not contain this code.

Gdb does not handle unwinding well in the absence of CFI information which results in the internal error. That needs to be fixed in upstream gdb but is non-trivial due to the current design so a fix is not expected soon.

The ROCm LLVM based AMD GPU compiler is open source in github so it may be possible to use that to generate correct CFI information. There were some recent fixes which may not be available in the github until the next release is published.

zjin-lcf commented 2 years ago

Thank you for explaining the technical issue and updates.

npmiller commented 1 year ago

The ROCm LLVM based AMD GPU compiler is open source in github so it may be possible to use that to generate correct CFI information. There were some recent fixes which may not be available in the github util the next release is published.

I've just ran into this issue again, do you know if there are any plans to upstream CFI generation for AMD targets?

t-tye commented 1 year ago

The ROCm LLVM based AMD GPU compiler is open source in github so it may be possible to use that to generate correct CFI information. There were some recent fixes which may not be available in the github util the next release is published.

I've just ran into this issue again, do you know if there are any plans to upstream CFI generation for AMD targets?

The debug extension support for AMD GPU has started to be upstreamed through the RFC at:

https://discourse.llvm.org/t/rfc-heterogeneous-debug-info/66872

karthik-man commented 2 weeks ago

I am seeing the same issue when I run bt after a SEGV. I am not able to figure out if the required changes have already been upstreamed to llvm. Kindly let me know if this error is still expected or if the error is related to my setup.

lancesix commented 2 weeks ago

Hi @karthik-man, are you using DPCPP? The debug information extensions required to debug GPU code are still in the process of being standardised. Work to integrate our approach is ongoing on LLVM, but might ultimately be tied to standardization progress. Until then, we only support debug information produced by ROCm's llvm (this is also compatible to debug CPU code using standard dwarf-5 debug info).

If you are using ROCm's llvm, can you share a reproducer that triggers the failure you are seeing so we can investigate?

karthik-man commented 1 week ago

I am not using ROCM's LLVM. The program I am trying to debug is a Triton kernel and Triton uses LLVM for codegen.