Document the timing / concurrency / ordering / synchronization guarantees of wgpu.

BGR360 commented 1 month ago

Is your feature request related to a problem? Please describe.

As someone looking to use wgpu for both rendering and compute in a project, I am finding it very difficult to understand the timing and ordering guarantees of wgpu. There is very little documentation in the rustdocs specifying when certain operations are guaranteed to execute sequentially or not, or whether a certain order is guaranteed.

Myself and others have been left puzzled:

Describe the solution you'd like

I would like the rustdocs for Queue, CommandEncoder/CommandBuffer, RenderPass, and ComputePass to specify what guarantees wgpu makes or doesn't make about the order and concurrency of certain operations. For a pair of operations A and B, the docs can say one of the following things about them:

wgpu guarantees that A finishes before B starts.
wgpu guarantees that A and B execute serially, but does not guarantee a particular order between them.
wgpu does not guarantee that A and B execute serially; the underlying hardware may execute them concurrently.
wgpu does not guarantee that A and B execute concurrently, but [most platforms / modern desktop platforms / platform X, Y, Z] will execute them concurrently.
wgpu guarantees that A and B execute concurrently.

Below is a list of operations that I want to see documented. For those that I've been able to find answers to, I've included what I know:

Legend

[ ] ❓ Undocumented, behavior unknown
[ ] 💬 Undocumented, partial understanding of behavior
[ ] 💡 Undocumented in rustdoc, but documented elsewhere
[x] ✅ Documented in rustdoc

Operations

[ ] ❓ Two draw calls in the same RenderPass.
[ ] 💬 Two dispatch_workgroup() calls in the same ComputePass.
- 💡 Are they executed serially? Yes^1
- ❓ Is there any ordering guarantee?
[ ] 💬 Two workgroups in the same dispatch.
- 💡 No guarantee of serial or concurrent execution^2.
- ❓ Do most platforms parallelize here as much as they can?
[ ] 💡 Two executions in the same workgroup.
- (5) Guaranteed concurrent^2.
[ ] 💬 Two RenderPasses in the same CommandBuffer.
- If pipelines don't depend on each other, render passes can execute concurrently^3.
- Is it possible to more thoroughly define the "depends on each other" part?
- Is it all-or-nothing? I.e. if pass 1 has one draw call with a pipeline that conflicts with one draw call of pass 2, must pass 1 and 2 be completely serial with each other? Or can they overlap on the parts that don't conflict?
[ ] 💬 Two ComputePasses in the same CommandBuffer.
- Same as above?
[ ] 💬 A RenderPass and a ComputePass in the same CommandBuffer.
- Same as above?
[ ] 💬 Two CommandBuffers in the same submit() call.
- 💡 Are they executed serially? Yes^4
- ❓ Is there any ordering guarantee?
[ ] 💬 Two submit() calls.
- 💡 Will the work of those two submissions be executed serially? Yes ^5
- ❓ Will the second call be blocked on the first work completing?
[ ] ❓ Two buffer transfer operations on a Queue before the next call to submit().
[ ] ❓ A buffer transfer operation on a Queue and any of the operations in the next submit() call.
[ ] ❓ Two buffer transfer operations in the same CommandBuffer.
[ ] ❓ A buffer transfer operation in a CommandBuffer and any of the other operations in that CommandBuffer.
[ ] ❓ A submit() call followed by an on_submitted_work_done() call.
- What does "previous" mean when referring to the previous call to submit()? If another thread calls submit() in between the current thread calling submit() and calling on_submitted_work_done(), will the callback wait for the other thread's submission too?

Describe alternatives you've considered

Experiment with wgpu's behavior by writing actual code and running it on my machine to see how it behaves.

Problems with this:

It's unclear how to measure or observe the behavior I'm looking for.
This only tells me how my machine works.
This requires lots of extra work on my end compared to reading the documentation. It sidetracks me from making progress on the thing I'm actually trying to work on.

Additional context

I had a nice donut for breakfast this morning.

cwfitzgerald commented 1 month ago

This is a very thorough issue, thank you for filing this! I don't have the time to fix all of these in the docs, but I will try to give you all the information you need to be able to fill it in yourself.

Logical Framework

In WebGPU, we synchronization is described with a primitive called a "usage scope". A usage scope is defined as:

A single dispatch.
An entire render pass.
A copy operation either on a command buffer, or queue.

When you submit things to a queue, all the usage scopes within the encoders submitted to that queue are executed as-if they were executed serially. Within each usage scope, operations may run in parallel. That is, the following may be run in an observably parallel fashion:

Invocations within a dispatch
Draw calls within a render pass
Copies of individual bytes/words within a copy operation.

The "as-if" rule dictates that they do not have to actually be serial, but such divergence must not be visible to the end user.

Direct Answers

This is a very theoretical framework, so I'll answer your questions directly, as well:

Two draw calls in the same RenderPass.
- May be observably parallel. Hardware will run these in parallel.
Two dispatch_workgroup() calls in the same ComputePass.
- May not be observably parallel. If there are no dependencies between dispatches, wgpu might run them in parallel. Hardware is capable of overlapping them.
Two workgroups in the same dispatch.
- May be observably parallel. Hardware will run these as wide as they have resources to do so. Serial execution is a valid implementation but not generally found in practice.
Two executions in the same workgroup.
- Same as 3.
Two RenderPasses in the same CommandBuffer.
- Same as 2.
- Generally synchronization here is all or nothing. There is some limited ability in the underlying apis on some hardware to reduce this, but most hardware would ignore it.
Two ComputePasses in the same CommandBuffer.
- Same as 2.
- Compute passes kinda don't exist, and are just wrappers around a series of dispatches.
A RenderPass and a ComputePass in the same CommandBuffer.
- Same as 2, though some hardware (nvidia) cannot overlap them.
Two CommandBuffers in the same submit() call.
- CommandBuffers are just a series of usage scopes, so execute as-if serial. Possible to get overlap, unlikely.
Two submit() calls.
- Same as 8 with the caveat that D3D12 cannot run submits in parallel.
- Calls are fully asynchronous.
Two buffer transfer operations on a Queue before the next call to submit().
- Same as 2, but due to how our barriers work, almost always serial currently.
A buffer transfer operation on a Queue and any of the operations in the next submit() call.
- Same as 10.
Two buffer transfer operations in the same CommandBuffer.
- Same as 2
A buffer transfer operation in a CommandBuffer and any of the other operations in that CommandBuffer.
- Same as 10.
A submit() call followed by an on_submitted_work_done() call.
- on_submitted_work_done will refer to the most recently completed submit call, whatever thread that happens to come from.

Well this is probably more confusing then helpful now that I've typed it out, so please hit me with more questions if things are still confusing. There is also a more general wrench thrown in to this as we let metal do auto-sync for us, so some of these are a bit different, but this is still should accurately describe the observable behavior.

kpreid commented 1 month ago

CommandBuffers are just a series of usage scopes, so execute as-if parallel.

Did you mean "as-if serial" here, since you said earlier that usage scopes are executed in series with each other?

cwfitzgerald commented 1 month ago

Ooops, yup! Fixed

BGR360 commented 1 month ago

This is a treasure trove of information, thank you very much Connor! I'll holler back with any clarifying questions I have when I have more time to come back to this and consider documenting things.

BGR360 commented 1 month ago

One thing I'll note right now: I think there are actually two distinct "user personas" that I want to address with this documentation. Let's call them Sasha and Troy.

Sasha

Sasha is working on a game engine. She has some ambitious plans for multi-pass renderers, shadow maps, antialiasing, all the works. She likes the flexibility and relative straightforwardness of the wgpu API. Reading the current documentation, she thinks she understands how to organize her render stages and feed resources from one into the next. However, she knows that GPUs like to parallelize work, and she's wondering about what she might need to do to guarantee that certain stages of her renderer run at the right times relative to each other. Also, within each RenderPass, she wants to know that draw calls will actually render one on top of the other in the right order, which is important for rendering sorted transparent objects.

Largely, Sasha cares about the observable ordering guarantees of wgpu.

Troy

Troy is working on a scientific simulation app with a GUI. He wants to maximize the parallelism / utilization of his GPU so he can do as much simulation as quickly as possible. He also needs his GUI to be completely responsive. Troy has chosen wgpu because of its popularity and compatibility across all platforms. He wants to know what things to avoid when using wgpu that would reduce the parallelism of his program. He also needs to know under what circumstances his compute work might threaten the responsiveness of his GUI. He cannot find these answers in the current documentation.

Compared to Sasha, Troy cares less about observable orderings, except when it comes to GUI responsiveness. Troy is more curious about how the actual hardware tends to parallelize things, and how to use wgpu to allow that parallelism.

magcius commented 1 month ago

The actual parallelization you will get depends on the vendor. In general, the idea of modern graphics APIs is not to declare what parallelization will happen, but instead to make the API and driver aware of when not to overlap, so it can make a smart decision about what to schedule when, but it's not a guarantee that things will overlap.

The general idea with usage scopes is that two separate compute dispatches that write to the same resource cannot overlap, and barriers are required to keep them apart. If you are doing a lot of dispatches that all write to the same resource, or a compute dispatch that reads from a resource after a previous compute dispatch writes to it, then we'll insert a barrier. Where possible, avoid large chains of work like this, as it will limit your parallelism.

The algorithm can be a bit conservative, and it's possible that wgpu is emitting extraneous barriers causing extra synchronization in some cases, but I wouldn't expect massive wins from changing our barrier strategy.

Here's a good resource for overlapping behavior of some different GPUs, but note that the results might change with different hardware and different driver versions: https://therealmjp.github.io/posts/breaking-down-barriers-part-6-experimenting-with-overlap-and-preemption/

Note how different the overlap patterns are for different IHVs. There is no way to definitively tell what will overlap. Multi-queue will help some IHVs, as you can imagine, but that is going to take some work to get there.

cwfitzgerald commented 1 month ago

I forgot the most important part:

I had a nice donut for breakfast this morning.

I had one of those giant chocolate chip muffins that are like 95% butter. It was amazing, but now I don't have any more, and I'm sad.

gfx-rs / wgpu