Debugging produced PTX?

dbeckwith commented 6 years ago

I'm trying to figure out how to produce a debuggable binary from the PTX file that I can debug with the NVIDIA tools. It's looking like I'm going to need a non-release-mode version of the PTX file. Could you could expose whether to pass the --release option to xargo or not? Is this even the right thing to do for trying to debug a CUDA kernel produced with this tool?

dbeckwith commented 6 years ago

I've found that I can get the kernel to pause on an exception and then I can attach cuda-gdb to it, but I still need a PTX file with debug symbols to be able to debug.

dbeckwith commented 6 years ago

Actually, this is pretty easy. I can make a PR.

denzp commented 6 years ago

Could you please tell more about your CUDA debugging experience? I'm curious because I never tried this before even with C++ CUDA.

dbeckwith commented 6 years ago

This is my first go at it as well. Reading the NVIDIA docs it seems pretty straightforward. The best feature I've seen so far is that if you set the environment variable CUDA_DEVICE_WAITS_ON_EXCEPTION, the CUDA kernel will pause for you when it encounters an error and then you can attach a debugging session with cuda-gdb.

However, I'm still having a problem with my application and haven't actually gotten to the point of being able to use the debugger yet. Building with the latest ptx-builder, ptx-linker, and llvm 6.0 in debug mode I still get a linker error (truncated because there was a lot of debug output):

[PTX]            [INFO] Linking without optimisations
[PTX]           LLVM ERROR: Cannot select: 0x7f64bb3f2ea0: f64 = fpow 0x7f64bb3e43a8, 0x7f64bb3e4138
[PTX]             0x7f64bb3e43a8: f64,ch = CopyFromReg 0x7f64bc0c5840, Register:f64 %11
[PTX]               0x7f64bb31d2d8: f64 = Register %11
[PTX]             0x7f64bb3e4138: f64 = fdiv ConstantFP:f64<1.000000e+00>, 0x7f64baa42068
[PTX]               0x7f64bb3e4bc8: f64 = ConstantFP<1.000000e+00>
[PTX]               0x7f64baa42068: f64 = uint_to_fp 0x7f64bb3f2958
[PTX]                 0x7f64bb3f2958: i16 = and 0x7f64bb3f2d00, Constant:i16<255>
[PTX]                   0x7f64bb3f2d00: i16,ch = CopyFromReg 0x7f64bc0c5840, Register:i16 %13
[PTX]                     0x7f64bb17d548: i16 = Register %13
[PTX]                   0x7f64baa42c30: i16 = Constant<255>
[PTX]           In function: _ZN6common4util17sample_hemisphere17h943cfe33e8d056d2E

I believe the line of code it's trying to compile is this:

let z = unsafe { core::intrinsics::powf64(r1, 1. / (n + 1) as f64) };

I don't get this error when I compile in release mode however which is strange. Any ideas?

dbeckwith commented 6 years ago

From playing round a bit, in both release and debug mode, I get linker errors saying it cannot select things like fpow and fcos, which wasn't happening with llvm 5.0. How do I use these math functions in PTX code?

dbeckwith commented 6 years ago

Also, with a very simple kernel that does compile and link in debug mode, when I try to load it the driver gives me CUDA_ERROR_INVALID_PTX. It still works in release mode though. So, to summarize where I'm at right now:

Kernel with math functions, debug mode: :x: linker errors
Kernel with math functions, release mode: :x: linker errors
Kernel without math functions, debug mode: :x: CUDA_ERROR_INVALID_PTX
Kernel without math functions, release mode: :heavy_check_mark:

denzp commented 6 years ago

I think it worth trying external libm implementation for math functions. fpow issue is kind of expected because there is no pow PTX instruction. A little more details can be found in the tutorial.

About CUDA_ERROR_INVALID_PTX issue, it likely can be solved with ptx-linker. We already make "corrections" to produced LLVM IR, like for example replacing . with _ in globals (because variable named const.3 is not valid for PTX). We need to figure out, what parts of debug assembly cause the issue.

denzp commented 6 years ago

My plan is to finish rust-crate-compile-test first and then we will be able to test ptx-linker more reliably. The linker is the core of Rust's CUDA stack and unfortunately, it currently lacks proper testing.

I'm afraid I need to make debug mode optional for now :(

dbeckwith commented 6 years ago

Thanks, I'll look into using a library to provide the math functions. Would it help if I provided the full PTX causing CUDA_ERROR_INVALID_PTX?

denzp commented 6 years ago

I think it would be nice if that's possible.

I suspect that even PTX of very small crates can also cause CUDA_ERROR_INVALID_PTX. I've seen this even when compiled a tutorial crate in debug mode.

dbeckwith commented 6 years ago

https://www.dropbox.com/s/kghoah6wobqcx5a/proxy.ptx?dl=1 I've noticed that the PTX file has names with . in them, which you said before was invalid, so maybe that's part of it?

denzp commented 6 years ago

I think files are okay, .file is a PTX directive. The syntax should be fine:

.file N "path"

What is actually weird, it's:

.visible .func  (.param .b32 func_retval0) _ZN76_$LT$core..marker..PhantomData$LT$T$GT$$u20$as$u20$core..cmp..PartialOrd$GT$11partial_cmp17h0fb7853047282263E
(
    .param .b64 _ZN76_$LT$core..marker..PhantomData$LT$T$GT$$u20$as$u20$core..cmp..PartialOrd$GT$11partial_cmp17h0fb7853047282263E_param_0,
    .param .b64 _ZN76_$LT$core..marker..PhantomData$LT$T$GT$$u20$as$u20$core..cmp..PartialOrd$GT$11partial_cmp17h0fb7853047282263E_param_1
)

You are right, . in the function name and params aren't okay. I'll try to make a minimal reproducible example and fix this on ptx-linker side.

denzp commented 6 years ago

Could you please try with ptx-linker@0.6.1 and ptx-builder@0.3.6? At least function names should be okay.

There is a good chance that the PTX won't cause CUDA_ERROR_INVALID_PTX anymore :)

dbeckwith commented 6 years ago

It not longer causes CUDA_ERROR_INVALID_PTX, but I think it's just because it's not building in debug mode.

dbeckwith commented 6 years ago

Ah, never mind, I see that I just have to set it to run in debug mode. I will try that out.

dbeckwith commented 6 years ago

Okay, confirming that my PTX built in debug mode does not cause CUDA_ERROR_INVALID_PTX anymore. However, when I try to debug using cuda-gdb, it keeps telling me "No symbol table info available." Could be that I'm using the debugger wrong, but I was expecting to be able to see a more readable stack trace and values of local variables.

denzp commented 6 years ago

Okay, thank you for testing!

Could you please create a simple repo with general steps, so I or somebody else can play around with debugging?

dbeckwith commented 6 years ago

I've started a repo with a similar layout to my main application I've been working on. I will add some steps to the README on how to introduce a bug to the code and then debug it.

denzp commented 6 years ago

Thanks for sharing, looks impressive! Glad to see that the whole Rust - CUDA thing (at least partially) works not only for me :)

About debugging... I was able to setup and hit a breakpoint at the kernel.

cuda-gdb --args target/debug/cuda-debugging-tutorial 20 ~/rust-inline-cuda-tutorial/fixtures/input-1024.png output-1024.png

(cuda-gdb) b blur
Function "blur" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y

(cuda-gdb) r
...

Thread 1 "cuda-debugging-" hit Breakpoint 1, 0x00007ffff5cf2008 in blur<<<(32,32,1),(32,32,1)>>> ()

(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status   SMs Mask   GridDim  BlockDim Invocation
*      0      -   0    1 Active 0x000fffff (32,32,1) (32,32,1) blur()

(cuda-gdb) info cuda devices
  Dev PCI Bus/Dev ID             Name Description SM Type SMs Warps/SM Lanes/Warp Max Regs/Lane                    Active SMs Mask
*   0        01:00.0 GeForce GTX 1080     GP104-A   sm_61  20       64         32           256 0x000000000000000000000000000ffff

So at least in theory debugging is possible :) Unfortunately it's far from being usable because somehow cuda-gdb doesn't care about source code locations defined in the debug PTX...

(cuda-gdb) list
1       <cuda-builtins>: No such file or directory.

(cuda-gdb) bt
#0  0x00007ffff5cf2008 in blur<<<(32,32,1),(32,32,1)>>> ()

(cuda-gdb) display/10i $pc
2: x/10i $pc
=> 0x7ffff5cf2008 <blur+8>:     MOV R1, c[0x0][0x20]
   0x7ffff5cf2010 <blur+16>:    {         IADD32I R1, R1, -0xc0
   0x7ffff5cf2018 <blur+24>:    S2R R0, SR_CTAID.X
   0x7ffff5cf2020 <blur+32>:
   0x7ffff5cf2028 <blur+40>:    {         XMAD R5, R0.reuse, c[0x0] [0x8], RZ
   0x7ffff5cf2030 <blur+48>:    PBK 0x258
   0x7ffff5cf2038 <blur+56>:    {         XMAD R2, R0.reuse, c[0x0] [0x8], RZ
   0x7ffff5cf2040 <blur+64>:    Cannot disassemble instruction
   0x7ffff5cf2044 <blur+68>:    Cannot disassemble instruction
   0x7ffff5cf2048 <blur+72>:    SSY 0x230

(cuda-gdb) s
Single stepping until exit from function blur,
which has no line number information.

I think we could try to create a small app with CUDA in Clang. Then we can compare Rust and Clang PTX assemblies for debugging-related things.

dbeckwith commented 6 years ago

Nice, you got a bit farther than I could. That was the next thing I was trying to figure out how to do, was set a breakpoint before the program ran.

If we need to get an example of PTX code we know is debuggable, I had actually already done some work in that direction. My first approach to this, which I gave up on, was to use the technique mentioned here, but adding the debug options to the original nvcc call. So I think we could make a sample application in CUDA C, compile using nvcc with debug options and --keep, and then inspect the PTX it produces in the middle of the pipeline. I really don't know much about PTX or what format cuda-gdb expects for debug symbols, but that might give us some real examples to work with.

Vespasian commented 5 years ago

Caveat: Minimally tested and super hacky.

I did some experiments on this issue and managed and got source code locations working. Debugging support for NVPTX was added in LLVM just after LLVM 8.0 (on which rust is currently based) was branched off.

I managed to backport the feature this commit and some others by the same author and create a custom version of llvm, rustc (added required interface and compile options), ptx-linker, and ptx-builder.

What is working:

Source Code Line Locations and single stepping for cuda-gdb (for the minimal examples I tested)
still runnable

What is not working:

Viewing variables and their values during debugging is not possible which is possibly related to...
..building ptx without optimizations crashes llvm. When calling rustc directly -O is required

I did this mainly as a quick experiment so do not expect any production ready code quality ;)

denzp commented 5 years ago

@Vespasian these are good findings, thanks!

Am I right, is it only 2 LLVM commits you had to backport? https://github.com/llvm/llvm-project/commit/897129dc3fe79137e00c5dccc30e11221dbd6d7f https://github.com/llvm/llvm-project/commit/f3a9150324c4fe88b7f9e28be18e0f989dff8af7

I think the ptx-linker and ptx-builder changes can safely be upstreamed, I'd be glad to review the PRs :)

denzp / rust-ptx-builder

Debugging produced PTX? #5