maxwell_gpu_s test stalls with --run-stream-scheduler

seanbaxter commented 1 year ago

I have -stdpar working with Circle now. I'm working on maxwell_gpu_s.cpp

$ circle maxwell_gpu_s.cpp -stdpar -sm_75 -I ../../include/ -std=c++20 
$ ./maxwell_gpu_s --run-stdpar
                  method, elapsed [s],   BW [GB/s]
            GPU (stdpar),       0.069,     171.013
$ ./maxwell_gpu_s --run-cuda
                  method, elapsed [s],   BW [GB/s]
              GPU (cuda),       0.061,     193.551
$ ./maxwell_gpu_s --run-stream-scheduler
                  method, elapsed [s],   BW [GB/s]
** hangs **
^C

The stream-scheduler process stalls, waiting for a conditional variable here: https://github.com/NVIDIA/stdexec/blob/main/include/stdexec/execution.hpp#L4545 called from here https://github.com/NVIDIA/stdexec/blob/main/include/stdexec/execution.hpp#L5957 called from here https://github.com/NVIDIA/stdexec/blob/main/examples/nvexec/maxwell/snr.cuh#L321

Thread 5 in the process appears to be the one wanting to yield, and it's stuck in __gthread_yield:

690 | static inline int
691│ __gthread_yield (void)
692│ {
693├─> return __gthrw_(sched_yield) ();
694│ }

That's called from std::this_thread::yield here: https://github.com/NVIDIA/stdexec/blob/main/include/nvexec/detail/queue.cuh#L162

Can you confirm that the example is at least supposed to get to this point, where that yield calls releases that condition variable? I'm totally unfamiliar with gthreads and want to know if I'm looking at a reasonable place to investigate.

seanbaxter commented 1 year ago

I think I managed to cut this down to a repro. Will close this out if that works.

seanbaxter commented 1 year ago

Sorry, I am going to need help with this mutex/condition variable stall. I want to figure out how to cut down this example into a repro I can debug.

What is the event that this line is waiting on? Is there a line number? I have no idea what code is trying to signal the run_loop. https://github.com/NVIDIA/stdexec/blob/main/include/stdexec/execution.hpp#L5957

seanbaxter commented 1 year ago

Which code is pushing a task that gets loaded out of here? https://github.com/NVIDIA/stdexec/blob/main/include/nvexec/detail/queue.cuh#L158

I'm comparing the nvc++ and circle-generated binaries, and after the 3rd or 4th iteration through that while loop, the nvc++-generated binary gets a non-null task. But I haven't been able to figure out what actually schedules that task, and or why it's not happening from the circle-generated binary.

griwes commented 1 year ago

Looks like line 117 in that file.

seanbaxter commented 1 year ago

It's not line 117. I build with nvc++ and set the breakpoint to that line and it never hits. I've been using hardware watchpoints on the memory for the next_ref on line 158, and that hasn't helped either. It sets off when the initial object is constructed, which is prior to line 158's thread being entered.

seanbaxter commented 1 year ago

The hardware watch isn't seeing anything on the affected address. Is it possible that a kernel is writing to that host memory address (through UVM), and the worker is polling so that the updated memory is eventually visible to it, and this is something that my compiler isn't catching? Does this example rely on -stdpar to replace the heap allocations? I can't actually find the kernel source code, there's so much tag_invoke I don't know what is being written where.

seanbaxter commented 1 year ago

I think it is line 117, being executed on the device, that sets that flag in host memory. I set the function to __host__ and generate this ODR-usage backtrace: https://gist.github.com/seanbaxter/3caa4926f341d4ef07ace32f47f94af5

I can printf from that kernel in the nvc++ version. But no output appears out of my binary. Maybe it's one of the atomics in that kernel. Libcu++ hijacks the definitions, so they aren't the well-tested libstdc++ versions I mostly use.

seanbaxter commented 1 year ago

It is fixed. I was passing kernel reference parameter types incorrectly into the cudaLaunchKernel parameter buffer. Before -stdpar, this path never got any use. In nvexec, it gets a lot of use. That was very hard. But now it works.

$ time /opt/nvidia/hpc_sdk/Linux_x86_64/23.1/compilers/bin/nvc++ -stdpar -I ../../include -std=c++20 maxwell_gpu_s.cpp -o maxwell-nv -lpthread --gcc-toolchain=/home/sean/projects/gcc-11.2.0/bin

real    0m14.878s
user    0m13.288s
sys 0m1.364s

$ time circle maxwell_gpu_s.cpp -sm_75 -stdpar -I ../../include -std=c++20 -lpthread 
note: compiled 404767 lines (14380840 bytes) in 1096 files and 1098677 tokens

real    0m2.297s
user    0m2.084s
sys 0m0.205s

$ ./maxwell_gpu_s --run-cuda --run-stream-scheduler --run-stdpar
                  method, elapsed [s],   BW [GB/s]
              GPU (cuda),       0.060,     195.419
   GPU (snr cuda stream),       0.061,     192.137
            GPU (stdpar),       0.066,     178.235

NVIDIA / stdexec

maxwell_gpu_s test stalls with --run-stream-scheduler #867