Enable code for dynamic parallelism

thedodd commented 1 year ago

Closes https://github.com/Rust-GPU/Rust-CUDA/issues/94

thedodd commented 1 year ago

So, interestingly, I'm running into an issue where the generated code can not be loaded by Module::from_ptx. It will return error a PTX JIT compilation failed.

Some background on current testing:

I've put together a reference C++ program which uses dynamic parallelism (ultra simple).
I can execute the reference program and all is good, expected output/behavior.
I also have a reference Rust program which is attempting to use this update code for dynamic parallelism, same exact functionality, data types (fixed sized types in C++);
When I compare the PTX between the two programs, it is nearly identical;
C++ program runs, expected behavior and output.

Now, what is quite strange is that if I copy the PTX from the working C++ program over to the Rust program (disabling PTX gen in the Rust program to ensure the C++ PTX is not overwritten), the Rust program aborts with that same error a PTX JIT compilation failed.

According to ptxas, both PTX files are valid and compile to object code (ptxas -c ...).
This issue is triggered even from attempting to construct a stream device side.
- Note that in my tests to narrow this down, I've removed stream construction and I am just passing in a null stream to the cuda launch call on the device.
- It is just interesting that the module loader does not like the stream or the launch.

So, I am wondering:

Is there something intrinsically wrong with attempting to call cuda::cuModuleLoadDataEx when the PTX is using dynamic parallelism?
Is there a way we can bypass this?
This is where my experimentation is currently at.

thedodd commented 1 year ago

Perhaps we need to be manually constructing a linker, linking the PTX and the cudadevrt.lib, then compiling to a cubin and such. Will try that.

thedodd commented 1 year ago

Yea, that was it. Need to create a linker, add the PTX, add libcudadevrt (right now I have this hard-coded, but I need to create a dynamic search mechanism, as I don't think the cuda linker will do this on its own ... we'll see).

From there, I was able to successfully execute the PTX from the sample C++ app of mine. The generated Rust PTX has an invalid memory access taking place, and it looks like it is coming from how the buffer is being populated. This is still a step forward, as the code gen is much easier to fix. I at least know what I'm dealing with, instead of some opaque "JIT compilation failed" error.

thedodd commented 1 year ago

Yea, that did it. Code gen is far from optimal for loading the param buffer. But it works, and I am able to successfully use dynamic parallelism from the Rust generated PTX end to end. Expected output and behavior.

Macro codegen for populating the buffer can be optimized further, as the generated PTX is not optimal. I'll focus on that later.

Rust-GPU / Rust-CUDA

Enable code for dynamic parallelism #96