Open thedodd opened 2 years ago
@RDambrosio016 whenever you get some time (no rush), let me know what you think. I am testing this out as I go on a fairly large project of mine which brought about this need in the first place.
Overall, the bridging code is quite simple. I've given an outline of how I think this should be exposed overall. Let me know what you think, happy to modify things as I go.
Also, for this first pass, I would like to keep focused only on the grid-level components of the cooperative groups API, as well as the basic cooperative launch host-side function. We can add multi-device and the other cooperative group components later.
This looks neat, but if im not mistaken, those functions map to single PTX intrinsics directly, wouldn't it be easier to use inline assembly? though i haven't actually looked into this so im not sure if they map to more than one PTX instruction
wouldn't it be easier to use inline assembly?
I started down that path at first, and for a few of the pertinent functions the corresponding PTX was clear. I was using a base C++ program compiled down to PTX to verify in addition to cross-referencing with the PTX ISA spec. However, I will say, many of the interfaces were not as clear, and this seemed to be a potentially more reliable way to generate the needed code.
Perhaps we can replace some of the clear interfaces with some ASM instead. Happy to iterate on this in the future.
This works as follows:
CudaBuilder
as normal.build.rs
, just after building their PTX, they will:cuda_builder::cg::CooperativeGroups
instance,-arch=sm_*
and so on),.compile(..)
, which will spit out a fully linkedcubin
,launch!
to schedule their GPU work, they will now uselaunch_cooperative!
.todo
cuLaunchCooperativeKernel
in a nice interface. We can add the cooperative multi device bits later, along with all of the other bits from the cooperative API.