Dynamic Parallelism | implementation strategy

thedodd commented 1 year ago

Well ... once again, I find myself in need of another feature. This time, dynamic parallelism.

Looks like this is also part of the C++ runtime API, similar to cooperative groups, for which I already have a PR.

I'm considering using a similar strategy for implementing this feature. I would love to just pin down the PTX, but that has proven to be a bit unclear; however, I will definitely start my search in the PTX ISA and see if there are any quick wins. If not, then probably a similar approach as was taken with the cooperative groups API.

Thoughts?

thedodd commented 1 year ago

The generated PTX from a C++ program using dynamic parallelism will tend to include the following .extern declarations in the PTX (comments added by me based on studying the PTX):

.extern .func  (.param .b64 func_retval0) cudaGetParameterBufferV2
(
    .param .b64 cudaGetParameterBufferV2_param_0, // Function pointer.
    .param .align 4 .b8 cudaGetParameterBufferV2_param_1[12], // Grid size.
    .param .align 4 .b8 cudaGetParameterBufferV2_param_2[12], // Block size.
    .param .b32 cudaGetParameterBufferV2_param_3 // Shared mem.
)
;
.extern .func  (.param .b32 func_retval0) cudaLaunchDeviceV2
(
    .param .b64 cudaLaunchDeviceV2_param_0, // Param buffer.
    .param .b64 cudaLaunchDeviceV2_param_1 // Stream.
)
;

This is inserted by nvcc device-side triple-chevron syntax is used. This appears to be updated V2 ABIs compared to what is documented here.

The V2 ABIs are much more simple, and building up the PTX for these seems to be pretty straightforward. I should have a PTX based solution for this in PR form quite soon.

I will likely just copy/paste the launch macro that we currently have in cust, and maybe add it to a shared location, or just copy/paste directly to the cuda_std module. We can decide what to do with it in the PR. Just to clarify, the launch macro extracts block & grid size declarations quite nicely, which is why I want to use the macro and then feed that code into the PTX ASM.

thedodd commented 1 year ago

Well, as it turns out, a great deal of the code (if not all) is already in place: https://github.com/Rust-GPU/Rust-CUDA/tree/master/crates/cuda_std/src/rt . I had originally been searching for this stuff in the docs, and was not able to find it. Looking in the code, there it is.

I will enable that rt module and start experimenting with it. I'll compare the generated PTX with an equivalent C++ program compiled via nvcc.

Rust-GPU / Rust-CUDA

Dynamic Parallelism | implementation strategy #94