Rust-GPU / Rust-CUDA

Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.
Apache License 2.0
2.97k stars 112 forks source link

Port to more current rust-nightly #98

Open apriori opened 1 year ago

apriori commented 1 year ago

breaking:

to be further tested:

apriori commented 1 year ago

It seems that e.g. cuda_std::println attempting to format a number is broken. If one changes the add_gpu to this:

extern crate alloc;

use cuda_std::prelude::*;

#[kernel]
#[allow(improper_ctypes_definitions, clippy::missing_safety_doc)]
pub unsafe fn add(a: &[f32], b: &[f32], c: *mut f32) {
    let idx = thread::index_1d() as usize;
    if idx < a.len() {
        let elem = &mut *c.add(idx);
        *elem = a[idx] + b[idx];

        if idx == 0 {
            cuda_std::println!("Elem 0: {}", *elem);
        }
    }
}

The resulting ptx will be invalid with

ptxas examples/cuda/resources/add.ptx, line 1370; error   : State space mismatch between instruction and address in instruction 'ld'
ptxas examples/cuda/resources/add.ptx, line 1399; error   : State space mismatch between instruction and address in instruction 'ld'

A offending ptx section looks like:

Top section, truncated:

.const .align 8 .u8 _ZN4core3fmt12USIZE_MARKER17h8e203fb7dfec90c9E[8] = {0XFF(_ZN4core3ops8function6FnOnce9call_once17h95dfe8b893b0399cE), 0xFF00(_ZN4core3ops8function6FnOnce9call_once17h95dfe8b893b0399cE), 0xFF0000(_ZN4core3ops8function6FnOnce9call_once17h95dfe8b893b0399cE), 0xFF000000(_ZN4core3ops8function6FnOnce9call_once17h95dfe8b893b0399cE), 0xFF00000000(_ZN4core3ops8function6FnOnce9call_once17h95dfe8b893b0399cE), 0xFF0000000000(_ZN4core3ops8function6FnOnce9call_once17h95dfe8b893b0399cE), 0xFF000000000000(_ZN4core3ops8function6FnOnce9call_once17h95dfe8b893b0399cE), 0xFF00000000000000(_ZN4core3ops8function6FnOnce9call_once17h95dfe8b893b0399cE)};
$L__BB6_5:
    mov.u64     %rd112, 0;
    ld.v2.u32   {%r5, %r6}, [%rd108+32];
    ld.u8   %rs3, [%rd108+40];
    st.local.u8     [%rd7+8], %rs3;
    st.local.v2.u32     [%rd7], {%r5, %r6};
    ld.u64  %rd109, [%rd108+24];
    ld.u16  %rs4, [%rd108+16];
    and.b16     %rs2, %rs4, 3;
    setp.eq.s16     %p6, %rs2, 2;
    mov.u64     %rd110, %rd112;
    @%p6 bra    $L__BB6_10;

    setp.ne.s16     %p7, %rs2, 1;
    @%p7 bra    $L__BB6_9;

    shl.b64     %rd63, %rd109, 4;
    add.s64     %rd64, %rd115, %rd63;
    add.s64     %rd18, %rd64, 8;
    ld.u64  %rd65, [_ZN4core3fmt12USIZE_MARKER17h8e203fb7dfec90c9E];
    ld.u64  %rd66, [%rd64+8];
    setp.ne.s64     %p8, %rd66, %rd65;
    mov.u64     %rd110, %rd112;
    @%p8 bra    $L__BB6_10;

    ld.u64  %rd68, [%rd18+-8];
    ld.u64  %rd109, [%rd68];
    mov.u64     %rd110, 1;
    bra.uni     $L__BB6_10;

Those both references are to the core::fmt::USIZE_MARKER constant. Not quite sure what is going on there.

Even though this error is there, a way more complex example code (not using cuda_std::println) is functional. So codegen is not "entirely" broken, but apparently something is odd with such cases of global, static constants.

thedodd commented 1 year ago

@apriori hello there. Just wanted to check-in and see if you've been having any success on this branch. I have a few open PRs, some of which I am actively using, and I'm thinking about rebasing them onto this branch in order to gain the update rustc benefits.

Think it is reasonable to rebase onto this branch?

thedodd commented 1 year ago

I am getting an an illegal memory access was encountered error where on master I am not getting the same error. I'll see if I can pin down the issue. This appears to be related to use of shared memory.

I moved back to master because I have a few parallel reduction algorithms that make heavy use of shared memory, and I don't want to take the time right now to debug the code gen issue :).

thedodd commented 1 year ago

TBH, I really hope that @RDambrosio016 (hope all is well) comes back some day. Having to move over to a C++ wrapper pattern, building lots of shared libraries, multi-stage nvcc build pipelines and such ... not fun.

This framework on the other hand already has a lot of work put into it, and keeping it up-to-date and moving forward is a huge boon to the community. I'm still holding out hope that it will be revitalized soon :).

apriori commented 1 year ago

TBH, I really hope that @RDambrosio016 (hope all is well) comes back some day. Having to move over to a C++ wrapper pattern, building lots of shared libraries, multi-stage nvcc build pipelines and such ... not fun.

This framework on the other hand already has a lot of work put into it, and keeping it up-to-date and moving forward is a huge boon to the community. I'm still holding out hope that it will be revitalized soon :).

I would wish the same, but so far it seems @RDambrosio016 lost interest/has no time anymore. This port should be more considered a hack. I have little to no knowledge in the field and was merely attempting to port over (similar as @RDambrosio016 did it) by taking rustc_codegen_llvm as a template.

For me non-trivial programs were working as long as cudastd::println was not used. I do not recall whether they used shared memory. I think I did ... gotta recheck.

Anyway, some more work should happen on this, or this framework will loose connection to rustc development entirely - nor will it gain acceptance. Unfortunately, using internal rustc libraries means a continous maintenance effort.

RDambrosio016 commented 1 year ago

Sorry, i've just been really busy with my degree and other things. I think being tied to a different codegen, and especially to libnvvm is not the way to go for the future. I think adding the required linking logic for nvptx in rustc is much easier and better. Im doing some experiments trying to do that.

thedodd commented 1 year ago

@RDambrosio016 nice! Hope all is going well with your studies.

apriori commented 1 year ago

@RDambrosio016 so you want to prefer using the already existing nvptx codegen backend of rustc? I remember you mentioning it has inferior optimizations compared to libnvvm. Then the long-term approach would be to improve that upstream, right?

thedodd commented 1 year ago

BTW, something I've done to help mitigate the issue with having to use the older compiler version:

There are a few ways to optimize this. Doesn't need to be an example, there are other ways. Keeping it out of the build.rs of the larger project is a way to help ensure that the rust toolchain limitation doesn't spread.

dssgabriel commented 1 year ago

@apriori Hello there!

I would like to port Rust-CUDA to the latest libNVVM (version 2.0) that came out with CUDA 12.0 (see #100). Is this draft up to date with current nightly (1.71.0 as of writing)? I think it would be better to base ourselves on a more recent version of rustc if we aim at bringing the whole crate up to date.

Despite what @RDambrosio016 said a few weeks ago about abandoning the NVVM codegen and moving to what's already implemented in rustc, I think we are much better off with rustc-codegen-nvvm at the moment. I haven't been able to generate valid PTX using the nvptx64-cuda target implemented in rustc, even on very simple AXPY-like kernels. There doesn't seem to be any efforts for better support by the compiler either, despite nvptx being a Tier 2 target. Moreover, the better optimizations opportunities and the fact that NVIDIA will continue supporting libNVVM in the future make it much more appealing to stay on this codegen IMHO. I think this is a great project and it would be a shame to throw all that hard work away.

I also heard that NVIDIA might be in the process of updating their tools to a much more recent LLVM version as even for them it's too difficult to rely on something as old as v7.0.1. This would probably simplify some of the logic implemented in rustc-codegen-nvvm but we shall see. Finally, it seems that some of the guys working on the NVHPC toolkit at NVIDIA are also Rust enjoyers and they'd be willing to push things for Rust if NVIDIA gets enough demand for it. I would very much like to see Rust carve a bigger spot in the field of HPC and GPGPU computing and this project feels like the best place to do so!

RDambrosio016 commented 1 year ago

@dssgabriel interesting, what do you mean by invalid PTX? i was not able to build anything since it requires a custom linker (my proposal in rustc would put the linking logic in rustc) that doesnt work on windows. The LLVM PTX backend is mature enough that i would expect it to generate valid code unless rustc is emitting something very suspicious.

apriori commented 1 year ago

@dssgabriel

I would like to port Rust-CUDA to the latest libNVVM (version 2.0) that came out with CUDA 12.0 (see #100). Is this draft up to date with current nightly (1.71.0 as of writing)? I think it would be better to base ourselves on a more recent version of rustc if we aim at bringing the whole crate up to date.

Unfortunately no. rustc is a rapidly moving target. I once just checked a just slightly more recent nightly after 2022/12/10 and compilation failed. There is two approaches for this I would consider "valid":

a) Fix the issues in this MR and continue from there b) Start over from current HEAD

One can and should use rustc_codegen_llvm as a template. But here and there more detailed knowledge about cuda PTX is required - some solutions I merely guessed and I bet I was wrong with that.

As far as I know though, libNVVM 2.0 is very different from prior versions. I think @RDambrosio016 can comment more on the feasability of this. I would also prefer to have these efforts more "upstream", but we are kind of lost if upstream rustc is not moving and/or improving with the PTX backend.

Despite what @RDambrosio016 said a few weeks ago about abandoning the NVVM codegen and moving to what's already implemented in rustc, I think we are much better off with rustc-codegen-nvvm at the moment. I haven't been able to generate valid PTX using the nvptx64-cuda target implemented in rustc, even on very simple AXPY-like kernels.

I cannot comment on this other than that I never really tried the official rustc ptx backend. Rust-cuda was simply the way more compelling and accessible solution. This is also due to @RDambrosio016 good documentation and immediately runnable examples, let alone all his hard work on building pretty much an ecosystem of libraries.

I think this is a great project and it would be a shame to throw all that hard work away. It is amazing work by @RDambrosio016 indeed. Still, I would imagine you could rebase this ecosystem of libs, bindings and APIs on a different codegen.

I also heard that NVIDIA might be in the process of updating their tools to a much more recent LLVM version as even for them it's too difficult to rely on something as old as v7.0.1. This would probably simplify some of the logic implemented in rustc-codegen-nvvm but we shall see.

As the interfacing would still be via libNVVM I doubt that has any impact on general accessibility. Maybe developer install experience might improve a bit when not depending on ancient llvm versions, but that is pretty much about it.

Finally, it seems that some of the guys working on the NVHPC toolkit at NVIDIA are also Rust enjoyers and they'd be willing to push things for Rust if NVIDIA gets enough demand for it. I would very much like to see Rust carve a bigger spot in the field of HPC and GPGPU computing and this project feels like the best place to do so!

So far my experience also with rust-cuda was that single-source is a thing I would really love to see, but its hard with the rust compilation model I would imagine, especially with cfg_if. NVIDIA pools a lot of ressources into the CUDA ecosystem and they do an amazing job. I am not sure what their take would be on "rewriting quite an amount of it". See for example https://github.com/NVIDIA/cub, which is absolutely crucial if you do not want to reinvent the square wheel all the time when writing high performance custom kernels. Still, even without it rust-cuda was for me a much better experience than plain Cuda and C++. Cuda is the best-in-class ecosystem when it comes to GPGPU, but still it feels decades behind general software tooling and language development. The match with C++ resulted in the worst possible compilation and tooling experience possible (compared to tooling in other languages) and there is absolutely nothing technological dictating this combination. Also, when you look at e.g. the shuffle instructions, you see C-like imperative languages feel like a fundamental misfit to describe the underlying computation model.

David-OConnor commented 8 months ago

Is there anything I can do to help? Is it just an issue of putting some line of code in the right place? Can we write a new bindgen wrapper to get low-level accesss?