Open dbeckwith opened 6 years ago
I've found that I can get the kernel to pause on an exception and then I can attach cuda-gdb
to it, but I still need a PTX file with debug symbols to be able to debug.
Actually, this is pretty easy. I can make a PR.
Could you please tell more about your CUDA debugging experience? I'm curious because I never tried this before even with C++ CUDA.
This is my first go at it as well. Reading the NVIDIA docs it seems pretty straightforward. The best feature I've seen so far is that if you set the environment variable CUDA_DEVICE_WAITS_ON_EXCEPTION
, the CUDA kernel will pause for you when it encounters an error and then you can attach a debugging session with cuda-gdb
.
However, I'm still having a problem with my application and haven't actually gotten to the point of being able to use the debugger yet. Building with the latest ptx-builder
, ptx-linker
, and llvm 6.0
in debug mode I still get a linker error (truncated because there was a lot of debug output):
[PTX] [INFO] Linking without optimisations
[PTX] LLVM ERROR: Cannot select: 0x7f64bb3f2ea0: f64 = fpow 0x7f64bb3e43a8, 0x7f64bb3e4138
[PTX] 0x7f64bb3e43a8: f64,ch = CopyFromReg 0x7f64bc0c5840, Register:f64 %11
[PTX] 0x7f64bb31d2d8: f64 = Register %11
[PTX] 0x7f64bb3e4138: f64 = fdiv ConstantFP:f64<1.000000e+00>, 0x7f64baa42068
[PTX] 0x7f64bb3e4bc8: f64 = ConstantFP<1.000000e+00>
[PTX] 0x7f64baa42068: f64 = uint_to_fp 0x7f64bb3f2958
[PTX] 0x7f64bb3f2958: i16 = and 0x7f64bb3f2d00, Constant:i16<255>
[PTX] 0x7f64bb3f2d00: i16,ch = CopyFromReg 0x7f64bc0c5840, Register:i16 %13
[PTX] 0x7f64bb17d548: i16 = Register %13
[PTX] 0x7f64baa42c30: i16 = Constant<255>
[PTX] In function: _ZN6common4util17sample_hemisphere17h943cfe33e8d056d2E
I believe the line of code it's trying to compile is this:
let z = unsafe { core::intrinsics::powf64(r1, 1. / (n + 1) as f64) };
I don't get this error when I compile in release mode however which is strange. Any ideas?
From playing round a bit, in both release and debug mode, I get linker errors saying it cannot select things like fpow
and fcos
, which wasn't happening with llvm 5.0
. How do I use these math functions in PTX code?
Also, with a very simple kernel that does compile and link in debug mode, when I try to load it the driver gives me CUDA_ERROR_INVALID_PTX
. It still works in release mode though. So, to summarize where I'm at right now:
CUDA_ERROR_INVALID_PTX
I think it worth trying external libm
implementation for math functions. fpow
issue is kind of expected because there is no pow
PTX instruction. A little more details can be found in the tutorial.
About CUDA_ERROR_INVALID_PTX
issue, it likely can be solved with ptx-linker
. We already make "corrections" to produced LLVM IR, like for example replacing .
with _
in globals (because variable named const.3
is not valid for PTX). We need to figure out, what parts of debug assembly cause the issue.
My plan is to finish rust-crate-compile-test first and then we will be able to test ptx-linker
more reliably. The linker is the core of Rust's CUDA stack and unfortunately, it currently lacks proper testing.
I'm afraid I need to make debug mode optional for now :(
Thanks, I'll look into using a library to provide the math functions.
Would it help if I provided the full PTX causing CUDA_ERROR_INVALID_PTX
?
I think it would be nice if that's possible.
I suspect that even PTX of very small crates can also cause CUDA_ERROR_INVALID_PTX
. I've seen this even when compiled a tutorial crate in debug mode.
https://www.dropbox.com/s/kghoah6wobqcx5a/proxy.ptx?dl=1
I've noticed that the PTX file has names with .
in them, which you said before was invalid, so maybe that's part of it?
I think files are okay, .file
is a PTX directive. The syntax should be fine:
.file N "path"
What is actually weird, it's:
.visible .func (.param .b32 func_retval0) _ZN76_$LT$core..marker..PhantomData$LT$T$GT$$u20$as$u20$core..cmp..PartialOrd$GT$11partial_cmp17h0fb7853047282263E
(
.param .b64 _ZN76_$LT$core..marker..PhantomData$LT$T$GT$$u20$as$u20$core..cmp..PartialOrd$GT$11partial_cmp17h0fb7853047282263E_param_0,
.param .b64 _ZN76_$LT$core..marker..PhantomData$LT$T$GT$$u20$as$u20$core..cmp..PartialOrd$GT$11partial_cmp17h0fb7853047282263E_param_1
)
You are right, .
in the function name and params aren't okay. I'll try to make a minimal reproducible example and fix this on ptx-linker
side.
Could you please try with ptx-linker@0.6.1
and ptx-builder@0.3.6
? At least function names should be okay.
There is a good chance that the PTX won't cause CUDA_ERROR_INVALID_PTX
anymore :)
It not longer causes CUDA_ERROR_INVALID_PTX
, but I think it's just because it's not building in debug mode.
Ah, never mind, I see that I just have to set it to run in debug mode. I will try that out.
Okay, confirming that my PTX built in debug mode does not cause CUDA_ERROR_INVALID_PTX
anymore. However, when I try to debug using cuda-gdb
, it keeps telling me "No symbol table info available." Could be that I'm using the debugger wrong, but I was expecting to be able to see a more readable stack trace and values of local variables.
Okay, thank you for testing!
Could you please create a simple repo with general steps, so I or somebody else can play around with debugging?
I've started a repo with a similar layout to my main application I've been working on. I will add some steps to the README on how to introduce a bug to the code and then debug it.
Thanks for sharing, looks impressive! Glad to see that the whole Rust - CUDA thing (at least partially) works not only for me :)
About debugging... I was able to setup and hit a breakpoint at the kernel.
cuda-gdb --args target/debug/cuda-debugging-tutorial 20 ~/rust-inline-cuda-tutorial/fixtures/input-1024.png output-1024.png
(cuda-gdb) b blur
Function "blur" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
(cuda-gdb) r
...
Thread 1 "cuda-debugging-" hit Breakpoint 1, 0x00007ffff5cf2008 in blur<<<(32,32,1),(32,32,1)>>> ()
(cuda-gdb) info cuda kernels
Kernel Parent Dev Grid Status SMs Mask GridDim BlockDim Invocation
* 0 - 0 1 Active 0x000fffff (32,32,1) (32,32,1) blur()
(cuda-gdb) info cuda devices
Dev PCI Bus/Dev ID Name Description SM Type SMs Warps/SM Lanes/Warp Max Regs/Lane Active SMs Mask
* 0 01:00.0 GeForce GTX 1080 GP104-A sm_61 20 64 32 256 0x000000000000000000000000000ffff
So at least in theory debugging is possible :) Unfortunately it's far from being usable because somehow cuda-gdb
doesn't care about source code locations defined in the debug PTX...
(cuda-gdb) list
1 <cuda-builtins>: No such file or directory.
(cuda-gdb) bt
#0 0x00007ffff5cf2008 in blur<<<(32,32,1),(32,32,1)>>> ()
(cuda-gdb) display/10i $pc
2: x/10i $pc
=> 0x7ffff5cf2008 <blur+8>: MOV R1, c[0x0][0x20]
0x7ffff5cf2010 <blur+16>: { IADD32I R1, R1, -0xc0
0x7ffff5cf2018 <blur+24>: S2R R0, SR_CTAID.X
0x7ffff5cf2020 <blur+32>:
0x7ffff5cf2028 <blur+40>: { XMAD R5, R0.reuse, c[0x0] [0x8], RZ
0x7ffff5cf2030 <blur+48>: PBK 0x258
0x7ffff5cf2038 <blur+56>: { XMAD R2, R0.reuse, c[0x0] [0x8], RZ
0x7ffff5cf2040 <blur+64>: Cannot disassemble instruction
0x7ffff5cf2044 <blur+68>: Cannot disassemble instruction
0x7ffff5cf2048 <blur+72>: SSY 0x230
(cuda-gdb) s
Single stepping until exit from function blur,
which has no line number information.
I think we could try to create a small app with CUDA in Clang. Then we can compare Rust and Clang PTX assemblies for debugging-related things.
Nice, you got a bit farther than I could. That was the next thing I was trying to figure out how to do, was set a breakpoint before the program ran.
If we need to get an example of PTX code we know is debuggable, I had actually already done some work in that direction. My first approach to this, which I gave up on, was to use the technique mentioned here, but adding the debug options to the original nvcc
call. So I think we could make a sample application in CUDA C, compile using nvcc
with debug options and --keep
, and then inspect the PTX it produces in the middle of the pipeline. I really don't know much about PTX or what format cuda-gdb
expects for debug symbols, but that might give us some real examples to work with.
Caveat: Minimally tested and super hacky.
I did some experiments on this issue and managed and got source code locations working. Debugging support for NVPTX was added in LLVM just after LLVM 8.0 (on which rust is currently based) was branched off.
I managed to backport the feature this commit and some others by the same author and create a custom version of llvm, rustc (added required interface and compile options), ptx-linker, and ptx-builder.
What is working:
What is not working:
I did this mainly as a quick experiment so do not expect any production ready code quality ;)
@Vespasian these are good findings, thanks!
Am I right, is it only 2 LLVM commits you had to backport? https://github.com/llvm/llvm-project/commit/897129dc3fe79137e00c5dccc30e11221dbd6d7f https://github.com/llvm/llvm-project/commit/f3a9150324c4fe88b7f9e28be18e0f989dff8af7
I think the ptx-linker
and ptx-builder
changes can safely be upstreamed, I'd be glad to review the PRs :)
I'm trying to figure out how to produce a debuggable binary from the PTX file that I can debug with the NVIDIA tools. It's looking like I'm going to need a non-release-mode version of the PTX file. Could you could expose whether to pass the
--release
option toxargo
or not? Is this even the right thing to do for trying to debug a CUDA kernel produced with this tool?