Closed seanbaxter closed 1 year ago
I think I managed to cut this down to a repro. Will close this out if that works.
Sorry, I am going to need help with this mutex/condition variable stall. I want to figure out how to cut down this example into a repro I can debug.
What is the event that this line is waiting on? Is there a line number? I have no idea what code is trying to signal the run_loop. https://github.com/NVIDIA/stdexec/blob/main/include/stdexec/execution.hpp#L5957
Which code is pushing a task that gets loaded out of here? https://github.com/NVIDIA/stdexec/blob/main/include/nvexec/detail/queue.cuh#L158
I'm comparing the nvc++ and circle-generated binaries, and after the 3rd or 4th iteration through that while loop, the nvc++-generated binary gets a non-null task. But I haven't been able to figure out what actually schedules that task, and or why it's not happening from the circle-generated binary.
Looks like line 117 in that file.
It's not line 117. I build with nvc++ and set the breakpoint to that line and it never hits. I've been using hardware watchpoints on the memory for the next_ref on line 158, and that hasn't helped either. It sets off when the initial object is constructed, which is prior to line 158's thread being entered.
The hardware watch isn't seeing anything on the affected address. Is it possible that a kernel is writing to that host memory address (through UVM), and the worker is polling so that the updated memory is eventually visible to it, and this is something that my compiler isn't catching? Does this example rely on -stdpar to replace the heap allocations? I can't actually find the kernel source code, there's so much tag_invoke I don't know what is being written where.
I think it is line 117, being executed on the device, that sets that flag in host memory. I set the function to __host__
and generate this ODR-usage backtrace:
https://gist.github.com/seanbaxter/3caa4926f341d4ef07ace32f47f94af5
I can printf from that kernel in the nvc++ version. But no output appears out of my binary. Maybe it's one of the atomics in that kernel. Libcu++ hijacks the definitions, so they aren't the well-tested libstdc++ versions I mostly use.
It is fixed. I was passing kernel reference parameter types incorrectly into the cudaLaunchKernel parameter buffer. Before -stdpar, this path never got any use. In nvexec, it gets a lot of use. That was very hard. But now it works.
$ time /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/compilers/bin/nvc++ -stdpar -I ../../include -std=c++20 maxwell_gpu_s.cpp -o maxwell-nv -lpthread --gcc-toolchain=/home/sean/projects/gcc-11.2.0/bin
real 0m14.878s
user 0m13.288s
sys 0m1.364s
$ time circle maxwell_gpu_s.cpp -sm_75 -stdpar -I ../../include -std=c++20 -lpthread
note: compiled 404767 lines (14380840 bytes) in 1096 files and 1098677 tokens
real 0m2.297s
user 0m2.084s
sys 0m0.205s
$ ./maxwell_gpu_s --run-cuda --run-stream-scheduler --run-stdpar
method, elapsed [s], BW [GB/s]
GPU (cuda), 0.060, 195.419
GPU (snr cuda stream), 0.061, 192.137
GPU (stdpar), 0.066, 178.235
I have -stdpar working with Circle now. I'm working on maxwell_gpu_s.cpp
The stream-scheduler process stalls, waiting for a conditional variable here: https://github.com/NVIDIA/stdexec/blob/main/include/stdexec/execution.hpp#L4545 called from here https://github.com/NVIDIA/stdexec/blob/main/include/stdexec/execution.hpp#L5957 called from here https://github.com/NVIDIA/stdexec/blob/main/examples/nvexec/maxwell/snr.cuh#L321
Thread 5 in the process appears to be the one wanting to yield, and it's stuck in __gthread_yield:
That's called from std::this_thread::yield here: https://github.com/NVIDIA/stdexec/blob/main/include/nvexec/detail/queue.cuh#L162
Can you confirm that the example is at least supposed to get to this point, where that yield calls releases that condition variable? I'm totally unfamiliar with gthreads and want to know if I'm looking at a reasonable place to investigate.