EmbarkStudios / rust-gpu

🐉 Making Rust a first-class language and ecosystem for GPU shaders 🚧
https://shader.rs
Apache License 2.0
7.28k stars 246 forks source link

Language feature: barriers #8

Open Jasper-Bekkers opened 4 years ago

Jasper-Bekkers commented 4 years ago

We want to have the same types of safety on the GPU as we do on the CPU. One of those areas that make it easy to introduce race conditions is with memory barriers. HLSL and GLSL require the programmer to explicitly put them in the right locations without any formal verifiction.

This is an attempt at solving that within the Rust type system (attempt still has racy bugs and is largely incomplete).

use core::ops::Deref;
use core::ops::DerefMut;
use core::ops::Index;
use core::ops::IndexMut;

struct MyFoo {
    test: GroupSharedArray<u32>,
}

struct GroupSharedArray<T>
where
    T: Default + Copy,
{
    data: Box<[T]>,
    size: usize,
}

impl<T: Default + Copy> GroupSharedArray<T> {
    fn new() -> Self {
        Self {
            size: 100,
            data: Box::new([T::default(); 100]),
        }
    }
}

struct GroupSharedWriter<T> {
    data: T,
}

impl<T> Deref for GroupSharedWriter<T> {
    type Target = T;

    fn deref(&self) -> &Self::Target {
        &self.data
    }
}

impl<T> DerefMut for GroupSharedWriter<T> {
    fn deref_mut(&mut self) -> &mut Self::Target {
        &mut self.data
    }
}

impl<T> GroupSharedWriter<T> {
    fn new(data: T) -> Self {
        Self { data }
    }
    fn barrier(self) -> GroupSharedReader<T> {
        GroupSharedReader { data: self.data }
    }
}

struct GroupSharedReader<T> {
    data: T,
}

impl<T> Deref for GroupSharedReader<T> {
    type Target = T;

    fn deref(&self) -> &Self::Target {
        &self.data
    }
}

impl<T> GroupSharedReader<T> {
    fn barrier(self) -> GroupSharedWriter<T> {
        GroupSharedWriter { data: self.data }
    }
}

impl<T: Default + Copy> Index<Uniform<u32>> for GroupSharedArray<T> {
    type Output = T;

    fn index(&self, index: Uniform<u32>) -> &Self::Output {
        &self.data[index.data as usize]
    }
}

impl<T: Default + Copy> IndexMut<Uniform<u32>> for GroupSharedArray<T> {
    fn index_mut(&mut self, index: Uniform<u32>) -> &mut Self::Output {
        &mut self.data[index.data as usize]
    }
}

struct Uniform<T> {
    data: T,
}

impl<T> Uniform<T> {
    fn new(data: T) -> Self {
        Self { data }
    }
}

fn thread_idx() -> Uniform<u32> {
    Uniform::new(0)
}

fn main() {
    let mut data = GroupSharedWriter::new(MyFoo {
        test: GroupSharedArray::new(),
    }); // new does barrier

    data.test[thread_idx()] = thread_idx().data;
    data.test[thread_idx()] = 666; // should be illegal

    let value = data.test[thread_idx()]; // should be illegal

    //data.test[0] = thread_idx().data; // what happens in this case? it's unsafe/unsound. race cnd over contents of `test`

    let other_data = data.barrier(); // `barrier` return GroupSharedReader

    // some race-y cases handled correctly:
    // data.test[thread_idx()] = 666; // is correctly marked illegal
    // other_data.test[thread_idx()] = 666; // is correctly marked illegal

    let test = other_data.data.test[thread_idx()];
}

// tl;dr
// - use move semantics to enforce barriers betwen reading and writing
// - broken right now because DerefMut and IndexMut require Deref and Index to be implemented
// - broken right now because we can write to the same location multiple times (data race)
Jasper-Bekkers commented 4 years ago

The Uniform type currently is very limited, but it prevents neighboring threads from writing into each other's LDS buckets concurrently (because we can't do any sort of math on it). However, we should investigate if we can relax this a bit, especially when reading it should be totally fine to index into a neighbors groupshared memory.

Jasper-Bekkers commented 4 years ago

Examples from @h3r2tic and others using group shared memory:

groupshared uint stack[GROUP_SIZE * LDS_STACK_SIZE];

bool bvhIntersectAny(Ray ray, uint stackBase : SV_GroupIndex) {
    uint bottom = stackBase * LDS_STACK_SIZE;
    uint stackPtr = bottom;
    stack[stackPtr++] = InvalidAddr;

    do {
        Node node = g_nodes[NonUniformResourceIndex(bvh)][addr];

        if(!isLeaf(node)) {
            // ... 
            if(traverseLeft || traverseRight) {
                uint postpone = InvalidAddr;
                // ...
                if(traverseLeft && traverseRight) {
                    stack[stackPtr++] = postpone; // needs manual bounds checking
                }

                continue;
            }
        } else {
            if(isInstance(node)) {
                stack[stackPtr++] = InvalidAddr; // needs manual bounds checking
                continue;
            } else {
                // 
            }
        }

        addr = stack[--stackPtr];// needs manual bounds checking

        if(addr == InvalidAddr && stackPtr > 0 && instanceId != ~0) {
            addr = stack[--stackPtr];// needs manual bounds checking
        }
    } while(addr != InvalidAddr);
}
Tobski commented 3 years ago

We had the following discussion in the Discord about trying to bring some of Rust's "fearless concurrency" to rust-gpu, by tweaking the way that accesses to shared memory (Buffers/UAVs, Workgroup memory, etc.) is managed when translating to SPIR-V. Just so we don't lose the conversation, I've pasted it below - can try to tease out more details later but for now we just want to be able to stow this so we don't lose it...

Tobias Hector Today at 00:30 So I've been thinking about the buffer/workgroup memory interface and how they need to ideally interact with synchronization. There's two key parts that kind of need to interact. The first is memory ordering - your typical acquire/release semantics as baseline here. Vulkan/SPIR-V also have some interesting additional semantics though once you look at the way the memory model works; notably values are not "coherent" by default like they are on CPU - you have to explicitly tell the compiler to make them visible to other invocations at a given scope. Acquire and release are also scoped - typically you want these scopes to agree with the visibility scope. (The handful of use cases for NOT doing that are really not worth worrying about in the first iteration outside of a single exception which is easy to deal with).

The other part is execution synchronization - telling threads in e.g. a workgroup, subgroup (/warp/wave), or whatever else, to get together so you don't race accesses. This needs to interact with the acquire/release operations, but the crucial part of it initially is simply that you probably need the acquire/release operations to know about sync operations in order to do useful compile time checks. The details of that aren't my concern right now. So what I'm thinking is that what would be incredibly nice, is if you could write code like this:

struct Buffer {
    data: [u32; 32],
}

fn shader(buffer: &mut Buffer) {
    buffer.data[SubgroupInvocationId()] = 2;
}

And have it fill in the appropriate acquire/release/coherency - without toooo much pain in terms of annotations.

Exploding out the "shader" function above, it should generate spir-v something like this:

fn shader(buffer: &mut Buffer) {
    let blah = &mut buffer.data[SubgroupInvocationId()];    // OpMemoryBarrier, Acquire

    *blah = 2;                                              // OpStore, MakePointerAvailable

    drop(blah);                                             // OpMemoryBarrier, Release
}

The rust-gpu backend can't just universally emit this for all shader function arguments because you need to handle the different storage classes differently and somewhere the synchronization scope needs annotating. Trying to turn this concept into something rusty is eluding me however - borrow/drop/access all need overriding somewhere, but the only place I can see to do that is by overriding the basic types or adding annoying syntax. I think/hope I'm just being a rust noob though :upside_down:

What I imagine would be quite nice is if you could just throw some attributes on the buffer struct that tell it that all accesses to its data have these semantics - but I don't know how realistic that is. Maybe this is something that would make more sense if I had a better sense of rustc internals. I'm hoping someone here has some good ideas :sweat_smile:

Jasper-BekkersToday at 00:31 Is this something we would need to apply to all buffer types? I'm trying to understand the plan a bit

Tobias HectorToday at 00:32 Uniform/Constant buffers wouldn't have any of these sync semantics

Jasper-BekkersToday at 00:32 You're thinking about making safety guarantees for concurrent access to buffers

Tobias HectorToday at 00:32 yes

Jasper-BekkersToday at 00:32 Yeah let's assume uavs

Tobias HectorToday at 00:32 trying to make synchronization and lifetime management one and the same as much as possible because that's... kind of what they are Synchronization in GLSL/HLSL is enormously painful, and rust has some nice tools to do much better, so ideally want to try to use them

Jasper-BekkersToday at 00:33 If we add . store and .load functions, could that do it?

Tobias HectorToday at 00:33 you'd probably need to have explicit acquire/release too I mean you could do it that way, but my gut says that seems like unnecessary programmer burden :woman_shrugging:

Jasper-BekkersToday at 00:34 Fair

Tobias HectorToday at 00:34 might work as an initial prototype, but I don't think it's tenable long term if you can make this work for UAVs it would also work for LDS/workgroup memory too :grimacing:

Jasper-BekkersToday at 00:35 Right so this would, on deref return a RAII object, that can acquire when constructed and release when dropped?

Jasper-BekkersToday at 00:35 And access to the data would go through this RAII guard

Tobias HectorToday at 00:36 yea broadly speaking that's basically it:

  • Acquire on borrow
  • Make available on store
  • Make visible on load
  • Release on drop
Tobski commented 3 years ago

One additional detail that got discussed after this little back and forth was that the aim here is to get borrow checking working across GPU threads - being able to know explicitly that indexing into the buffer is safe. For instance, if you're mutably borrowing from a buffer via a unique index (say indexing by the workgroup ID per workgroup), then a compile time check should be able to tell you your access is safe. For debugging, it should also be possible to automatically insert (somewhat expensive) checks that invocations aren't racing. There's probably also a place in there somewhere for something akin to rust's explicit lifetimes to allow more complex indexing to be compile-time checked.

From the SPIR-V side the aim is to make painless things that typically developers get frustrated over or make mistakes with - acquire/release/barriers are a pain in the prominent shading languages, and experience shows developers very much tend to get them wrong. The proposal discussed above takes that decision making away from the developer and hard codes what the most useful case as a default behavior - unlike GLSL/HLSL. We may want to provide hooks for less common patterns further down the line, but for now this behaviour should cover the needs of any developer that isn't trying to experiment with synchronization patterns...

Tobski commented 3 years ago

Putting this on the correct issue this time....

Ok so turning this into something a bit more concrete, I think I'd propose three new attributes that affect struct definitions declared as storage buffers/images/texel buffers (UAVs), shared memory (LDS), or anything else we somehow end up being able to touch from multiple threads:

#[SubgroupVisible]
#[WorkgroupVisible]
#[DeviceVisible]

Each of these attributes modifies what happens during borrow and drop, and memory semantics used for OpLoad/OpStore when accessing references to these structs.

Note: If none of these attributes are present it should be a compiler error to borrow a mutable reference from the struct, (unless the code is declared unsafe?).

Scope

The word before "Visible" in the attribute determines the scope at which writes to the structure are made visible to other shader invocations. So "SubgroupVisible" indicates that other invocations in the same subgroup can see values written by other invocations, assuming correct synchronization. This scope is used as the %scope operand for the instructions detailed below.

Acquire on Borrow

When a reference is borrowed from the structure, an OpMemoryBarrier should be performed as follows:

OpMemoryBarrier %scope Acquire|%storage_class

Release on Drop

When a reference to the structure is dropped, an OpMemoryBarrier should be performed as follows:

OpMemoryBarrier %scope Release|%storage_class

Loads

When a reference to the structure is read from, an OpLoad should be generated with the following Memory Operands:

%value = OpLoad %pointer MakePointerVisible|NonPrivatePointer %scope

or similarly for images:

%value = OpImageRead ... MakeTexelVisible|NonPrivateTexel %scope

Stores

When a reference to the structure is written to, an OpStore should be generated with the following Memory Operands:

OpStore %pointer %value MakePointerAvailable|NonPrivatePointer %scope

or similarly for images:

OpImageWrite ... MakeTexelAvailable|NonPrivateTexel %scope
khyperia commented 3 years ago

A couple issues:

1) AFAIK, there is no way to tell in the backend when a borrow starts/stops, and so we cannot emit code at those points. 2) How do you intend to deal with OpAccessChain to a field on the struct? All attributes for the resulting pointer would be dropped, and we would no longer be able to emit the proper memory operands for load/store.

Tobski commented 3 years ago

@khyperia Thanks for looking at this!

AFAIK, there is no way to tell in the backend when a borrow starts/stops, and so we cannot emit code at those points.

My understanding is that it's possible to override borrow/drop in the language itself, rather than in the backend, so I was imagining this be dealt with at least partially above the compiler; perhaps emitting some sort of intrinsic that the backend would turn into the right memory barriers. I was experimenting with this on the rust playground and made some decent progress in getting a struct in regular rust to output something (println! with the relevant instruction) at borrow/drop, but like a fool I hadn't saved it locally and lost it all when my browser tab crashed 🤦‍♀️

How do you intend to deal with OpAccessChain to a field on the struct? All attributes for the resulting pointer would be dropped, and we would no longer be able to emit the proper memory operands for load/store.

Ok so again I'm not necessarily expecting this to be a wholly backend issue. My hope here is that we can coax the front-end into tagging the variables (and/or the variable accesses), such that when it gets to the backend all you'd need to do is check if a relevant tag is there or not. Unlike the first issue you pointed out, I don't have a handle on how this might be done with rust's attribute system yet - so this is kind of an open issue that I'm hoping we can have more discussion on.

Tobski commented 3 years ago

So I've been thinking about the implementation complexity, and given the newness of this project and that it's probably useful to bootstrap this in the short term, maybe there should be two phases to this? The initial phase will have dedicated acquire/release and load/store ops, and then the more complex/ergonomic thing here can be a future investigation? Unless anyone thinks that's a terrible idea I'll write this into the draft RFC I'm working on.