Add Cooperative Groups API integration

thedodd commented 2 years ago

This works as follows:

Users build their Cuda code via CudaBuilder as normal.
If they want to use the cooperative groups API, then in their build.rs, just after building their PTX, they will:
- Create a cuda_builder::cg::CooperativeGroups instance,
- Add any needed opts for building the Cooperative Groups API bridge code (-arch=sm_* and so on),
- Add their newly built PTX code to be linked with the CG API, which can include multiple PTX, cubin or fatbin files,
- Call .compile(..), which will spit out a fully linked cubin,
In the user's main application code, instead of using launch! to schedule their GPU work, they will now use launch_cooperative!.

todo

[x] Update cust to expose the cuLaunchCooperativeKernel in a nice interface. We can add the cooperative multi device bits later, along with all of the other bits from the cooperative API.
[x] Update cuda_builder crate to expose a nice API around linking PTX code with the Cooperative Groups API bridge code.
[x] Remove the Justfile. I was only using it for POC testing.
[x] Remove the PTX code from this branch, currently it is just here for reference.

closes #87

thedodd commented 2 years ago

@RDambrosio016 whenever you get some time (no rush), let me know what you think. I am testing this out as I go on a fairly large project of mine which brought about this need in the first place.

Overall, the bridging code is quite simple. I've given an outline of how I think this should be exposed overall. Let me know what you think, happy to modify things as I go.

Also, for this first pass, I would like to keep focused only on the grid-level components of the cooperative groups API, as well as the basic cooperative launch host-side function. We can add multi-device and the other cooperative group components later.

RDambrosio016 commented 2 years ago

This looks neat, but if im not mistaken, those functions map to single PTX intrinsics directly, wouldn't it be easier to use inline assembly? though i haven't actually looked into this so im not sure if they map to more than one PTX instruction

thedodd commented 2 years ago

wouldn't it be easier to use inline assembly?

I started down that path at first, and for a few of the pertinent functions the corresponding PTX was clear. I was using a base C++ program compiled down to PTX to verify in addition to cross-referencing with the PTX ISA spec. However, I will say, many of the interfaces were not as clear, and this seemed to be a potentially more reliable way to generate the needed code.

Perhaps we can replace some of the clear interfaces with some ASM instead. Happy to iterate on this in the future.

Rust-GPU / Rust-CUDA

Add Cooperative Groups API integration #87

todo