Rust-GPU / Rust-CUDA

Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.
Apache License 2.0
2.97k stars 112 forks source link

Cooperative Groups Impl #80

Open thedodd opened 1 year ago

thedodd commented 1 year ago

I believe I am at a point where I need the cooperative groups API. Instead of re-writing my kernel code in C++, or using CXX to bridge the Rust code into C++, I would prefer to implement the Cooperative Groups API instead (at least some portion of it).

I've read the documentation on it a few times now. Not sure if others have already looked into this. Just wanted to touch base if folks have concerns or pointers as I dig into implementation.

RDambrosio016 commented 1 year ago

A basic version of cooperative groups could probably be done, the difficulty with them is that it's a C++ API, which means digging into the C++ code for them in the SDK files, which will probably be very painful.

thedodd commented 1 year ago

Currently experimenting with creating a C++ bridge via: https://github.com/dtolnay/cxx. Bindgen itself does support C++ bindings, but there is a fair number of known limitations.

Update: I'll try using bindgen first, and just enable the C++ features (-std=c++11). If I run into any serious difficulties there, I'll cut over to CXX.

thedodd commented 1 year ago

@RDambrosio016 I'm having trouble determining how you were originally generating the cust_raw bindings via bindgen. The setup that is currently in master is a bit non-intuitive, and I don't see any docs for this.

I see the bindgen.sh script which invokes the bindgen CLI, however the script is not even executable, which makes me wonder if it was ever actually used, and given that it is invoked outside of the build.rs context (at least, it is not invoked as part of build.rs), then I'm not sure how it would be able to utilize the find_cuda_helper crate's functionality.

I will probably expand things in the script so that:

As it is right now, the script is not portable and does not work on its own.

thedodd commented 1 year ago

Quick update. My approach was off a bit initially. I did indeed need to update the bindgen.sh script for cust_raw, as it was not working correctly. I've updated it to work in a fairly nice way now with minimal updates.

However, the cooperative groups API is pretty much all kernel side, so as long as the cuLaunchCooperativeKernel (and cuLaunchCooperativeKernelMultiDevice which is deprecated) are exposed in the generated cuda.rs, then we are good. Everything looks good on that front.

Next, looks like the real task is to update the cuda_std crate with some gpu_only code which will link to the correct symbols from the cooperative_groups API. Experimenting with that now.

thedodd commented 1 year ago

@RDambrosio016 ok, another update here. Neither bindgen nor CXX seem suited to exposing the cooperative_groups internals to our cuda_std code.

I'm wondering if it would be reasonable to define an extern "C" wrapper around the needed cooperative_groups classes and functions, then we compile that down to PTX, ship it with cuda_std, and then have the cuda_builder just link our wrapper PTX with whatever PTX is generated for users. We would then simply update cuda_std to declare the extern bits and wrap them as needed.

I'm just not sure what other options we have. Are you familiar with any good ways to expose the following code to Rust (simplified C++ from cooperative_groups.h):

class grid_group : public thread_group_base<details::grid_group_id>
{
    _CG_STATIC_CONST_DECL unsigned int _group_id = details::grid_group_id;
    friend _CG_QUALIFIER grid_group this_grid();
 private:
    // .. snip ..  
 public:
    _CG_QUALIFIER void sync() const {
        if (!is_valid()) {
            _CG_ABORT();
        }
        details::grid::sync(&_data.grid.gridWs->barrier);
    }
    // .. snip ..
}

_CG_QUALIFIER grid_group this_grid() {
    grid_group gg(details::get_grid_workspace());
    return gg;
}

The things that I need most right now are grid_group this_grid() & grid_group.sync().

Thoughts?

thedodd commented 1 year ago

Else ... best approach might be to use c-bindgen and expose Rust bits to C++ kernels. Compile those down to PTX and then just launch from the Rust code.

thedodd commented 1 year ago

Ok, after lots of experimentation and dead ends, I've got a working solution here: https://github.com/Rust-GPU/Rust-CUDA/pull/87. More to be done, but this proves that there is a viable path forward. Now I just need to make it pretty.