Cooperative Groups Impl

thedodd commented 2 years ago

I believe I am at a point where I need the cooperative groups API. Instead of re-writing my kernel code in C++, or using CXX to bridge the Rust code into C++, I would prefer to implement the Cooperative Groups API instead (at least some portion of it).

I've read the documentation on it a few times now. Not sure if others have already looked into this. Just wanted to touch base if folks have concerns or pointers as I dig into implementation.

RDambrosio016 commented 2 years ago

A basic version of cooperative groups could probably be done, the difficulty with them is that it's a C++ API, which means digging into the C++ code for them in the SDK files, which will probably be very painful.

thedodd commented 2 years ago

Currently experimenting with creating a C++ bridge via: https://github.com/dtolnay/cxx. Bindgen itself does support C++ bindings, but there is a fair number of known limitations.

Update: I'll try using bindgen first, and just enable the C++ features (-std=c++11). If I run into any serious difficulties there, I'll cut over to CXX.

thedodd commented 2 years ago

@RDambrosio016 I'm having trouble determining how you were originally generating the cust_raw bindings via bindgen. The setup that is currently in master is a bit non-intuitive, and I don't see any docs for this.

I see the bindgen.sh script which invokes the bindgen CLI, however the script is not even executable, which makes me wonder if it was ever actually used, and given that it is invoked outside of the build.rs context (at least, it is not invoked as part of build.rs), then I'm not sure how it would be able to utilize the find_cuda_helper crate's functionality.

I will probably expand things in the script so that:

it is executable;
will generate bindings for cooperative_groups.h separately with -std=c++11;
make the script target the headers probably via changing dir to the cuda include dir, and then spitting out the rust code via absolute path (derived early in the script).

As it is right now, the script is not portable and does not work on its own.

thedodd commented 2 years ago

Quick update. My approach was off a bit initially. I did indeed need to update the bindgen.sh script for cust_raw, as it was not working correctly. I've updated it to work in a fairly nice way now with minimal updates.

However, the cooperative groups API is pretty much all kernel side, so as long as the cuLaunchCooperativeKernel (and cuLaunchCooperativeKernelMultiDevice which is deprecated) are exposed in the generated cuda.rs, then we are good. Everything looks good on that front.

Next, looks like the real task is to update the cuda_std crate with some gpu_only code which will link to the correct symbols from the cooperative_groups API. Experimenting with that now.

thedodd commented 2 years ago

@RDambrosio016 ok, another update here. Neither bindgen nor CXX seem suited to exposing the cooperative_groups internals to our cuda_std code.

Bindgen can't handle the various symbols in the C++ code which are unique to nvcc.
CXX can use nvcc as the backend, however linking the CXX build outputs with cuda_std when it is being compiled down to PTX has a few issues.
- I actually had to fork CXX to modify a few lines of code such that the generated C++ code would have a .cu extension (instead of .cc) so that nvcc would compile the device code properly.

I'm wondering if it would be reasonable to define an extern "C" wrapper around the needed cooperative_groups classes and functions, then we compile that down to PTX, ship it with cuda_std, and then have the cuda_builder just link our wrapper PTX with whatever PTX is generated for users. We would then simply update cuda_std to declare the extern bits and wrap them as needed.

I'm just not sure what other options we have. Are you familiar with any good ways to expose the following code to Rust (simplified C++ from cooperative_groups.h):

class grid_group : public thread_group_base<details::grid_group_id>
{
    _CG_STATIC_CONST_DECL unsigned int _group_id = details::grid_group_id;
    friend _CG_QUALIFIER grid_group this_grid();
 private:
    // .. snip ..  
 public:
    _CG_QUALIFIER void sync() const {
        if (!is_valid()) {
            _CG_ABORT();
        }
        details::grid::sync(&_data.grid.gridWs->barrier);
    }
    // .. snip ..
}

_CG_QUALIFIER grid_group this_grid() {
    grid_group gg(details::get_grid_workspace());
    return gg;
}

The things that I need most right now are grid_group this_grid() & grid_group.sync().

Thoughts?

thedodd commented 2 years ago

Else ... best approach might be to use c-bindgen and expose Rust bits to C++ kernels. Compile those down to PTX and then just launch from the Rust code.

thedodd commented 2 years ago

Ok, after lots of experimentation and dead ends, I've got a working solution here: https://github.com/Rust-GPU/Rust-CUDA/pull/87. More to be done, but this proves that there is a viable path forward. Now I just need to make it pretty.

Rust-GPU / Rust-CUDA

Cooperative Groups Impl #80