EmbarkStudios / rust-gpu

🐉 Making Rust a first-class language and ecosystem for GPU shaders 🚧
https://shader.rs
Apache License 2.0
7.35k stars 245 forks source link

API Design: StorageBuffers #249

Open charles-r-earp opened 4 years ago

charles-r-earp commented 4 years ago

Summary

This proposal attempts to address #232, #180, and #8. StorageBuffers, aka Buffer Blocks, are the primary inputs / outputs to compute shaders, though they can also be used in other stages. Iterators are one of Rust's highlights, and I propose emulating them in gpu code.

Example: Scaled Add aka Saxpy

// Rust Cpu
fn scaled_add(x: &[f32], mut y: &mut [f32], alpha: f32) {
    for (x, mut y) in x.iter().copied().zip(y) {
        *y += alpha * x;
    }
}

// OpenCL 
__kernel void scaled_add(__global const float* x, __global float* y, float alpha, uint n) {
    uint gid = get_global_id(0);
    if gid < n {
        y[gid] += alpha * x[gid];
    }
}

// GLSL 
#version 450

layout(set=0, binding=0) buffer Input {
    float x[];    
}

layout(set=0, binding=1) buffer Output {
    float y[];    
}

layout(push_constant) uniform PushConsts {
    float alpha;
    uint n;
}

void main() {
    uint gid = gl_GlobalInvocationID.x;
    if gid < n {
        y[gid] += alpha * x[gid];
    }
}

Saxpy is trivially parallel, that is, it can be separated into n independent operations. There are plenty of other similar kinds of operations common in CUDA / OpenCL code, which do not require any synchronization / barriers. In fact, this is probably the most common case.

Possible Naive Implementation in rust-gpu

#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: Buffer<[f32]>, mut y: BufferMut<[f32]>, alpha: f32, n: u32) {
    let gid = spirv_std::global_x();
    if gid < n {
        unsafe {
            *y.get_mut_unchecked(gid) = x.get_unchecked(gid);
        }    
    } 
} 

This is the most straightforward translation of the above to rust gpu code. If we neglect concerns about aliasing with other shaders, then the only potential failure mode would be that the user provided n is outside the bounds of either x or y. The programmer is responsible, by using unsafe, to ensure no aliasing within the shader.

The rayon crate allows for easy parallel processing on cpu's, with its ParallelIterator trait. It looks like this:

// rayon
fn scaled_add(x: &[f32], mut y: &mut [f32], alpha: f32) {
    use rayon::iter::ParallelIterator;
    for (mut y, x) in y.par_iter_mut().zip(x.iter().copied()) {
        *y += alpha * x;
    }
}

Rayon divides the work into n parts, where n is the number of workers. It knows that partioning a slice, even a mutable one, is safe. I propose a similar api for rust-gpu.

Proposal

Buffers: Runtime Arrays of T: Copy

Arrays: Like buffers, but with a const size, either const generics or array type ie <T = f32, const N: usize = 1024>, or <T = [f32; 1024]>

Matrices!?? Would probably want const generics, ie Matrix2<T, D1 = 100, D2 = 64>

Blocks: A single T struct item, T: Copy

If necessary, mutable Block fields could be accomplished via some sort of Mutex or ArcCell equivalent, but that would require relaxing the Copy requirement.

Arrays and Blocks are safer because the runtime can validate the inputs prior to launching. Start with blocks, then arrays, then buffers. Note that all of these require a special "Block" decorated struct wrapper, at least per SPIR-V specification. Slices are also tricky / not allowed in exports because of Rust's unstable abi.

Iterators

GlobalIterator trait

GroupIterator trait

*Iter's are like slice::Iter, they iterate over a borrow.

IntoGlobalIterator / IntoGroupIterator

*IntoIter's: consume their container, but still yield borrowed values. This allows them to mutate the output, but they are consumed from the scope of the shader.

Unsafe

Buffers, Arrays, may have unsafe access to their internal Slice. It may not always be possible to prove that a program is safe, and it will take time to implement enough safe wrappers to fit every need.

Likewise, access to invocation specific values, like the global xyz, must be unsafe or even not allowed at all to ensure that the safe GlobalIterator construct is in fact safe. This means that the shader cannot get the u32 value, but could say, manipulate it mathmatically, and index a Buffer / slice / or pointer with it (this would require unsafe). The key thing is that it can't be read and it can't be used in control flow outside of the inner closure of for_each. This could be implemented as a wrapper, ie Idx(u32), which implements the appropriate traits.

Barriers

Barriers are emitted by Iter and IterMut iterators as required. IntoIter's should be able to ommit barriers, since they consume their inputs, so that they cannot be read / written to again within the shader.

Globally Const vs Per Invocation

For trivial cases like axpy, there is no need for the shader to access non-const memory (outside of the closure passed to for_each). The closure cannot mutate it's environment, or even borrow it, and nothing is returned from it. This prevents non-static control flow. Only the push_constants would be copied into the closure (via move).

Single Invocation operations

In some cases, it may be necessary to have only one invocation peform some work. This could potentially handled with a special SingleGlobalBufferIntoIter or the like.

Putting it all together

#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: GlobalArray<f32, 100>, y: GlobalArrayMut<f32, 100>, alpha: f32) {
    y.into_global_iter_mut().zip(x)
        .for_each(|(mut y, x)| *y += alpha * x);
}
Jasper-Bekkers commented 3 years ago

Related to #216 so might be good to get @Tobski's eyes on this as well.

Jasper-Bekkers commented 3 years ago

For trivial cases like axpy, there is no need for the shader to access non-const memory (outside of the closure passed to for_each). The closure cannot mutate it's environment, or even borrow it, and nothing is returned from it. This prevents non-static control flow. Only the push_constants would be copied into the closure (via move).

Would you mind elaborating "non-const memory" here?

I've had discussions with @Tobski before, and the iterator based approach to data access is actually quite appealing to us - especially having something like you've proposed for the simpler cases.

One extension we discussed back then was to have a UniformIndex-style type which would allow safe remapping operations on the equivalent of gl_globalinvocationid (at least in such a way that they wouldn't clobber other elements in the buffer). Kind of like this:

#[allow(unused_attributes)]
#[spirv(gl_compute)]
pub fn scaled_add(x: GlobalArray<f32, 100>, y: GlobalArrayMut<f32, 100>, alpha: f32) {
    y.into_global_iter_mut().map_idx(|idx| idx ^ 2).zip(x)
        .for_each(|(mut y, x)| *y += alpha * x);
}
XAMPPRocky commented 3 years ago

We discussed this at the meeting today, and the consensus that we're generally in favour of idea of adding storage buffers, and we really like the style of using iterators for this task. However we think that there's still a lot of design work to be done in this, and would like to see a full RFC on this topic.

Additionally before we'd accept an RFC on a safe API for storage buffers, we'd like to first see an unsafe API for storage buffers available, that allows people to prototype and build their own safe abstractions before merging it into spirv-std.

charles-r-earp commented 3 years ago

For trivial cases like axpy, there is no need for the shader to access non-const memory (outside of the closure passed to for_each). The closure cannot mutate it's environment, or even borrow it, and nothing is returned from it. This prevents non-static control flow. Only the push_constants would be copied into the closure (via move).

Would you mind elaborating "non-const memory" here?

Hmmm, by non-const I mean the data in buffers, and global_id's, sizes etc. The idea is that the closure passed to for_each can't borrow anything, and doesn't return anything, and is the only way, at least safely, to access the data in buffers or the global id. If we know that at entry level that the code only has access to things that are uniform across all invocations, then control flow must also be uniform. The non uniform part is either in the closure, or expressed via the iterator, ie if the buffer is len 100 and the global size is 128, then the extra 28 are invalidated, rather than reading out of bounds. This acts like zipping a collection with a range.