Closed chenzhuofu closed 1 year ago
What do you mean with CRS stack? It's for restoring all the state that cannot be manipulated using die CUDA Debugger API. That includes the PC, warp masks, barriers, call stack (although this rarely happens, as CUDA inlines everything by default), and maybe more that I can't think of right now.
What do you mean with CRS stack?
I've found this definition here (p33), as is similar to what you mentioned about.
But I feel that the accuracy of how cricket restores it is somewhat questionable. I think when we encounter nested branches, we may not simply re-execute the nearest SSY.
Or maybe I do not fully understand the approach of cricket?
Please see below for an example of the jump table. After each SSY instruction there is a branch instruction. "Jumping around" in this code allows any number of SSY instructions to be executed when restoring. This enables cricket to fully restore the stack of possibly multiple nested branches. Note that this code only works for the Pascal architecture. YMMV for newer archs (although I believe they handle this similarly).
I hope this answers your question.
"Jumping around" in this code allows any number of SSY instructions to be executed when restoring. This enables cricket to fully restore the stack of possibly multiple nested branches.
Yes, I think so. But I found in the source code it seems only to restore the nearest SSY before relative_pc.
Please check the function cricket_cr_rst_ssy and the function cricket_elf_pc_info.
Beside nested branches, another problem is warp divergence.
Please take a look at the following code snippet.
if(expr){
block A;
}
block B;
During the run time (in a warp), some of threads may evaluate expr to true, while others may evaluate false. Can take expr = threadId.x < 16 as a case.
That means only part of threads in a warp would take the branch and execute block A. The condition is called warp divergence and we can hardly handle it (how to ckp/rst deactive threads' state?). We could imagine that block A may contain the call to next level kernel in the callstack.
However I believe warp divergence is rare in pratical cuda programs.
The code you linked does also deal with nesting. Warp divergence is the same thing. SSY sets the convergence point for diverged threads. Cricket restores diverged branches using the debugger API. In modern architectures this works different though.
Why are you interested in all of these details?
I have read through the gpu part code and found that we would patch the binary file to make a jumptable during the restoring phase.
I think it's for restoring the CRS stack of warps at the time when partially executed kernel was checkpointed. Is that right?