Tracking Issue and Proposal for the Next Step of wgpu Automatic Synchronization Generation

cwfitzgerald commented 3 years ago

The Problem

Vulkan, DX12, and Metal provide us the opportunity to generate very high-quality synchronization that allows us to keep the GPU pipeline fully filled. Currently, we have a system that generates correct, but naive barriers. The apis allow us a variety of methods by which we can further optimize our barriers.

This proposal will work backward, starting with the current barriers that are generated and comparing them to an ideal set of barriers. It will then go over the various implements steps I think we should take to get to this end goal.

An Example

To help illustrate the point let's look at an example of a workload that could be optimized. The real life scenario is a compute shader using a set of read-only inputs to generate a set of indirect draw calls, then a multi-draw-indirect call to render the draw calls. This workload is split into two, the first compute shader will generate commands for the first multi-draw-indirect and the second compute shader the second multi-draw-indirect. This has significantly better pipelining potential than a single compute call and single multi-draw-indirect.

Current State

Our current barrier system generates barriers Just In Time and all pipeline barriers are ALL -> ALL. Each scope is processed individually with a barrier dropped in the command buffer before it. (I will address The Compute Problem™ later).

Native Barriers

Our current system leaves us with 4 different waits in this workload, each draining the pipeline entirely before continuing. This is far from ideal.

Reachback

The next step in the improvement of our barriers is to combine barriers so that we remove unnecessary ones. In the above example, there are two unnecessary barriers: S0/S1 and S2/S3. The memory barriers that they are responsible for could be covered by an earlier barrier and the code would be just as correct.

Reachback

This leaves us with 2 different waits, each draining the pipeline entirely before continuing. Much better, but we can do better.

Note about stage flags

At this point the astute reader may notice that all of our barriers are ALL -> ALL. This is overly conservative but in all these situations would not benefit us significantly as any more accurate reading would be butted up against the thing they are blocking to/from, causing them to be full pipeline drains either way.

Events

This is the cream of the crop in terms of enabling the most pipelining from our graphics code. By tracking the source and destination of each memory dependency, we can generate events that, in an ideal pipelining case, will cause no waiting on the GPU.

Events

This leaves us with no waits, allowing the pipeline to stay populated and running at full speed. This is what peak performance looks like.

How on earth do we implement this?

Here's the section where I try to explain how we need to implement this. These will be fairly high-level and express concepts more than specifics, but they should be enough to figure out how to begin thinking about implementing them.

We should establish some test cases such as the above so we can test both how well the sync is working, but the effect on performance on real hardware, both CPU and GPU.

The Compute Problem™

For those not familiar with the webgpu standard's lingo on synchronization, a "usage scope" refers to a section of code that is treated as a single "block" of operations. You cannot use resources in conflicting ways within the same usage scope. The standard currently dictates that a renderpass is a single usage scope and a single dispatch is also a usage scope. This is the easiest situation to use but leaves wgpu between a rock and a hard place wrt synchronization.

To solve this problem, I propose adding a native-only feature which adds a boolean on a compute pass which dictates if the compute pass should be treated as many usage scopes or only a single usage scope.

Leaving Command Buffers Open

It is an implementation of wgpu that each compute and render pass get their own internal command buffer. Once we apply the above change, each usage scope corresponds with a command buffer. Specifically, the beginning of a usage scope corresponds with the ending of the previous command buffer. Instead of finishing this command buffer at the end of the usage scope, we hold it open until submit time.

Late Sync Writing

Once we get access to the ends of all command buffers at submit time, we can start writing barriers at the beginning of every usage scope. This single global picture is vital to properly deduplicating sync and generating events

Understanding Who We're Waiting For

This is likely an oversimplification. Once we have all the usage scopes together, we globally number them in their order of submission. We then "bake" the tracking information for each usage scope into information about which memory and image barriers are waiting for which scope number. This data structure could treat all barriers waiting for a particular scope as a block. We can then iterate over the usage and either lay a new barrier or add the memory and image barriers to a previous barrier. Once this data was all sorted into location, we would go through and actually record the barriers.

Properly Encoding Who We're Waiting For

A "fairly trivial" (famous last words) extension to the above scheme is to generate events that encode where these waits start and stop. I believe we can iterate through the generated barriers and figure out the latest scope that barrier is waiting for. We can then note to set an event there which we would wait for where the barrier is. This would get us to our goal

Extras

write_buffer and write_image and mapped_at_creation

This scheme is only discussing the use of render and compute passes. When you consider the use of transfers as well, which are currently littered with barriers, we can deploy a special kind of tracking which we can keep queue-local (as queues have a single timeline) which we can omit all internal barriers unless the same buffer is written to twice. This command buffer will be added to the mix during submit time and will end up with an event at the end of the buffer. This will be waited on as soon as the data from any transfer is needed. This might not be completely ideal, but will only be a single wait.

cmd-buffer recorded transfers

Idk how to help these. Use write_buffer :stuck_out_tongue:

Rough Todo List

This is based on the above steps and may be totally meaningless.

[ ] The Compute Problem™
[ ] Leaving Command Buffers Open
[ ] Late Sync Writing
[ ] Understanding What We're Waiting For
[ ] Properly Encoding Who We're Waiting For
[ ] write_*

Disclaimer

This is best effort and may have many issues, synchronization can be mind-bending at times :smile:. I will keep this document updated as plans change

DethRaid commented 3 years ago

What is the compute problem, why is it such a large problem that it's a proper noun, and how does this proposal fix it?

The first paragraph in the section The Compute Problem doesn't even mention the compute problem

kvark commented 3 years ago

@cwfitzgerald thank you for investigating this! I'm looking forward to have better (and fewer) barriers in wgpu. One thing that would definitely help me to understand your ideas is having a bit of pseudocode (for both algorithms and data) corresponding to the prose.

cwfitzgerald commented 3 years ago

Addressed some grammatical and content issues, thanks @DethRaid.

@kvark Definitely, I'm just not sure how to even write that pseudocode at this point 😄. If it becomes more clear in the future, I definitely will.

cwfitzgerald commented 3 years ago

After much thinking about it, I think I have a proper algorithm for generating split barriers/events while remaining O(uses) like our current sync gen.

Recording side:

We will consider the following as "Passes" for the purpose of this algorithm:
- RenderPass
- ComputePass
- A contiguous set of transfer operations. An implicit "TransferPass" per se.
Within the recording of a Pass, we track (and barrier), any dependencies within them. For example, if a ComputePass has two dispatches that use the same buffer read/write, we would generate a barrier between them. RenderPasses are guaranteed to not have any internal dependencies.
Once the recording of a Pass is finished, the Pass's api command buffer + tracker set is stored inside the command buffer. It is not combined with any previous tracker set.

Submit side:

During submission, we are given an array of api command buffers + trackers (one per Pass) to submit. We now go through and generate the synchronization between each Pass.
The api used is like sync2, but can fall back to sync1 or normal pipeline barriers if needed.

let pass_last_used = HashMap<Resource, usize>;
// True if the last usage involved writing
let needs_barrier = HashMap<Resource, bool>;

// For every pass, the list of memory barriers that it is waiting for.
let incoming_barriers = Vec<Vec<MemoryBarrier>>;

// For every pass, the index of the _latest_ pass it depends on.
let pass_deps = Vec<usize>;

for (pass_idx, tracker) in tracker_sets.enumerate() {
    for resource in tracker.resources {
        if needs_barrier[resource] {
            let pass_last_used_idx = pass_last_used[resource];
            let barrier = MemoryBarrier::new(tracker_set[pass_last_used_idx], resource);

            incoming_barriers[pass_idx].push(barrier);
            pass_deps[pass_idx].max_eq(pass_last_used_idx);
        }
        let needs_barrier_new = resource.is_write();
        if needs_barrier_new {
            needs_barrier[pass_idx] = needs_barrier_new;
        }
    }
} 

// For every pass, the list of events it triggers
let event_triggers = Vec<Vec<usize>>;

for (idx, (pass_dep, barriers)) in zip(incoming_barriers, pass_deps).enumerate() {
    event_triggers[pass_dep].push(idx);
}

// For every pass, the barriers that go before it/after the last one
let barrier_buffers = Vec<CommandBuffer>;
// For every pass, the event that it _waits for_.
let events = Vec<Events>;
for (idx, incoming) in incoming_barriers[1..].enumerate() {
    let buffer = CommandBuffer::new();
    for trigger in event_triggers {
        buffer.setEvent(event[trigger], incoming_barriers[trigger]);
    }
    buffer.waitEvent(event[idx], incoming);
    barrier_buffers.push(buffer);
}

bonsairobo commented 2 years ago

Is my understanding correct that, as it stands today, every dispatch of a compute pipeline is surrounded by ALL -> ALL barriers? And does this mean that it's not possible to have parallel compute dispatches on multiple input bindings?

To solve this problem, I propose adding a native-only feature which adds a boolean on a compute pass which dictates if the compute pass should be treated as many usage scopes or only a single usage scope.

I would love to see this use case enabled. @cwfitzgerald How much work would you estimate there is to add that bool?

gfx-rs / wgpu