MixmasterFresh / rust-on-gpu

A fork of the Rust Language for experimenting with GPU support.
https://www.rust-lang.org
Other
5 stars 0 forks source link

Discussion of rust on gpu design #1

Open MixmasterFresh opened 8 years ago

MixmasterFresh commented 8 years ago

This should be the best place to hold a discussion of the work we are going to do here.

MixmasterFresh commented 8 years ago

I think that to start we should establish that this project should be providing effective support for both NVPTX and AMDGPU targets. With this in mind I think it would be awesome to see how much commonality we could extract from the two targets (in terms of intrinsics and the like) in order to create something resembling a write once run anywhere notion for rust on gpu. There are definitely some steps we can take to make this a reality, and I would be really interested to see how close we could get. There are, of course, some limitations to this strategy, but the notion of abstracting out some of these common functionality intrisics could make rust on gpu many times more powerful of a tool.

japaric commented 8 years ago

Copy pasting my thoughts about the design from an e-mail exchange:

I'm of the opinion that we should change/extend the language as little as possible; I think that that approach increases the chances of this work landing upstream. In particular, I believe (though I have no definitive proof) that the only required change is to extend rustc to produce PTX from Rust code then CUDA-like single source programs can be written with the help of a plugin. I've sketched a tentative design for such plugin here.


I'm not familiar with AMDGPU. Can you use it to write number crunching programs like with CUDA, perhaps in conjunction with OpenCL? Any links to how such program looks like?

MixmasterFresh commented 8 years ago

I agree that we should not be dealing too much in changing the language itself. I am simply saying if we have three sets of "intrinsics" (AMD, PTX, and a generic intrinsic set to be determined at compile time based on target) there are some definite upsides. We should only be trying to expose new targets, and leaving everything else to crates. I think we are on the same page here. All I am saying is that if we have the opportunity to generalize some of the concepts between the two targets, we should do it in the interest of cross-compatability.

As for AMDGPU, it is just AMD's GPU assembly just as PTX is for NVIDIA. They do "pretty much" the same thing. The biggest difference is that you can't use AMDGPU with CUDA(as well as there being some minor functionality differences). You can use AMDGPU with OpenCL.

japaric commented 8 years ago

I am simply saying if we have three sets of "intrinsics" (AMD, PTX, and a generic intrinsic set to be determined at compile time based on target) there are some definite upsides

Oh right, I agree. We can provide a stable device-agnostic interface to the "intrinsics" plus device-specific opt-ins a la os::std. Something like this:

// in libcore maybe? or in a new libgpu crate

// Unstable forever
pub mod intrinsics {
    extern "rust-intrinsic" {
        // ..

        // Emits the `llvm.cuda.syncthreads` intrinsic
        #[cfg(arch = "nvptx")]
        fn cuda_syncthreads();

        // Emits the equivalent AMDGPU intrinsic
        #[cfg(arch = "amdgpu")]
        fn amdgpu_syncthreads();

        // These GPU intrinsics hopefully can be kept private
    }
}

pub mod gpu {
    /// Acts a synchronization barrier for all the GPU threads
    pub fn syncthreads() {
        #[cfg(arch = "amdgpu")]
        ::intrinsics::amdgpu_syncthreads();

        #[cfg(arch = "nvptx")]
        ::intrinsics::cuda_syncthreads();
    }

    // AMDGPU specific code (hopefully empty or minimal)
    #[cfg(arch = "amdgpu")]
    pub mod amdgpu {
        // ..
    }

    // NVPTX specific code
    #[cfg(arch = "nvptx")]
    pub mod ptx {
        // ..
    }
}

On the language side, I think we'll have to at minimum expose the OpenCL/CUDA address space qualifiers: constant, global, etc. in one form or another -- it could be via attributes e.g. #[constant] static FOO. Both the PTX and the AMDGPU backends support these address spaces via the addrspace (LLVM-IR) attribute.


SPIR-V

I haven't yet read too much of the SPIR-V spec but I looked at the spirv-llvm repository.

It appears that in principle we can add a spir(64)-unknown-unknown (rustc) target that compiles executables to *.spv(64) files. How? There is a llvm-spirv tool that translates LLVM bitcode to SPIR-V, so we could do the same thing that the asmjs target does: have rustc emit LLVM bitcode instead of object files and then run the llvm-spirv tool as a linker over them to produce SPIR-V from the bitcode.


Testing

It would be nice to have 3 minimal hello-world-ish programs written in CUDA, OpenCL and Vulkan (kinda like my CUDA memcpy test) that just launch a kernel to test our produced PTX/AMDGPU/SPIR-V. Eventually these could become an integration (run-make) test.

MixmasterFresh commented 8 years ago

SPIR-V

The spirv-llvm repo is a nightmare. I would avoid wasting too much time with it. I spent several weeks trying to get it to work, and it would just keep breaking in new and interesting ways. I couldn't even get it to compile when I tried to move it up to LLVM 3.8. I think it is likely that at some point in the future, someone will add a spir-v target to LLVM, but until that happens, I wouldn't worry about it.

Testing

Right now we can only run the Travis tests, so throw all of the current tests(code generation) in there. Eventually we can export the PTX and AMDGPU artifacts that they create, and test those elsewhere, but that is still a ways out. I think you were right when you said that we should make unit tests that generate PTX or AMDGPU and check that it contains certain elements. Eventually we can add regression tests to check emitted PTX or AMDGPU against the last stable build on master.

Device Manufacture Abstraction

That was exactly what I was thinking.

Getting to work

It looks like you have done most of the work on the PTX target, so soon we can get started merging that in here. Right now I have just gotten started on the AMDGPU side of things. I am also trying to get some AMD(and technically) opencl programs ready so that I can test that target..

japaric commented 8 years ago

The spirv-llvm repo is a nightmare. I think it is likely that at some point in the future, someone will add a spir-v target to LLVM, but until that happens, I wouldn't worry about it.

Ah well. We can wait.

eddyb commented 8 years ago

@japaric FWIW such intrinsics are currently under "platform-intrinsic" and things like <arch>_syncthreads would fit right in. They're auto-generated by a python script from some JSON files, in src/etc/platform-intrinsics, if you want to play with that.

japaric commented 8 years ago

Device Manufacture Abstraction

So, I think this shouldn't be done in this repository but instead it should be iterated as a crate on crates.io, just like the simd crate; that crate can use the platform intrinsics exposed by the compiler. I think that ultimately the Rust project should only provide:

MixmasterFresh commented 8 years ago

That makes sense. As I thought about it more, I was starting to see some issues(lack of common framework, lack of common terminology).