Open BGR360 opened 1 month ago
This is a very thorough issue, thank you for filing this! I don't have the time to fix all of these in the docs, but I will try to give you all the information you need to be able to fill it in yourself.
In WebGPU, we synchronization is described with a primitive called a "usage scope". A usage scope is defined as:
When you submit things to a queue, all the usage scopes within the encoders submitted to that queue are executed as-if they were executed serially. Within each usage scope, operations may run in parallel. That is, the following may be run in an observably parallel fashion:
The "as-if" rule dictates that they do not have to actually be serial, but such divergence must not be visible to the end user.
This is a very theoretical framework, so I'll answer your questions directly, as well:
Well this is probably more confusing then helpful now that I've typed it out, so please hit me with more questions if things are still confusing. There is also a more general wrench thrown in to this as we let metal do auto-sync for us, so some of these are a bit different, but this is still should accurately describe the observable behavior.
CommandBuffers are just a series of usage scopes, so execute as-if parallel.
Did you mean "as-if serial" here, since you said earlier that usage scopes are executed in series with each other?
Ooops, yup! Fixed
This is a treasure trove of information, thank you very much Connor! I'll holler back with any clarifying questions I have when I have more time to come back to this and consider documenting things.
One thing I'll note right now: I think there are actually two distinct "user personas" that I want to address with this documentation. Let's call them Sasha and Troy.
Sasha is working on a game engine. She has some ambitious plans for multi-pass renderers, shadow maps, antialiasing, all the works. She likes the flexibility and relative straightforwardness of the wgpu API. Reading the current documentation, she thinks she understands how to organize her render stages and feed resources from one into the next. However, she knows that GPUs like to parallelize work, and she's wondering about what she might need to do to guarantee that certain stages of her renderer run at the right times relative to each other. Also, within each RenderPass
, she wants to know that draw calls will actually render one on top of the other in the right order, which is important for rendering sorted transparent objects.
Largely, Sasha cares about the observable ordering guarantees of wgpu.
Troy is working on a scientific simulation app with a GUI. He wants to maximize the parallelism / utilization of his GPU so he can do as much simulation as quickly as possible. He also needs his GUI to be completely responsive. Troy has chosen wgpu because of its popularity and compatibility across all platforms. He wants to know what things to avoid when using wgpu
that would reduce the parallelism of his program. He also needs to know under what circumstances his compute work might threaten the responsiveness of his GUI. He cannot find these answers in the current documentation.
Compared to Sasha, Troy cares less about observable orderings, except when it comes to GUI responsiveness. Troy is more curious about how the actual hardware tends to parallelize things, and how to use wgpu to allow that parallelism.
The actual parallelization you will get depends on the vendor. In general, the idea of modern graphics APIs is not to declare what parallelization will happen, but instead to make the API and driver aware of when not to overlap, so it can make a smart decision about what to schedule when, but it's not a guarantee that things will overlap.
The general idea with usage scopes is that two separate compute dispatches that write to the same resource cannot overlap, and barriers are required to keep them apart. If you are doing a lot of dispatches that all write to the same resource, or a compute dispatch that reads from a resource after a previous compute dispatch writes to it, then we'll insert a barrier. Where possible, avoid large chains of work like this, as it will limit your parallelism.
The algorithm can be a bit conservative, and it's possible that wgpu is emitting extraneous barriers causing extra synchronization in some cases, but I wouldn't expect massive wins from changing our barrier strategy.
Here's a good resource for overlapping behavior of some different GPUs, but note that the results might change with different hardware and different driver versions: https://therealmjp.github.io/posts/breaking-down-barriers-part-6-experimenting-with-overlap-and-preemption/
Note how different the overlap patterns are for different IHVs. There is no way to definitively tell what will overlap. Multi-queue will help some IHVs, as you can imagine, but that is going to take some work to get there.
I forgot the most important part:
I had a nice donut for breakfast this morning.
I had one of those giant chocolate chip muffins that are like 95% butter. It was amazing, but now I don't have any more, and I'm sad.
Is your feature request related to a problem? Please describe.
As someone looking to use
wgpu
for both rendering and compute in a project, I am finding it very difficult to understand the timing and ordering guarantees ofwgpu
. There is very little documentation in the rustdocs specifying when certain operations are guaranteed to execute sequentially or not, or whether a certain order is guaranteed.Myself and others have been left puzzled:
Describe the solution you'd like
I would like the rustdocs for
Queue
,CommandEncoder
/CommandBuffer
,RenderPass
, andComputePass
to specify what guaranteeswgpu
makes or doesn't make about the order and concurrency of certain operations. For a pair of operations A and B, the docs can say one of the following things about them:wgpu
guarantees that A finishes before B starts.wgpu
guarantees that A and B execute serially, but does not guarantee a particular order between them.wgpu
does not guarantee that A and B execute serially; the underlying hardware may execute them concurrently.wgpu
does not guarantee that A and B execute concurrently, but [most platforms / modern desktop platforms / platform X, Y, Z] will execute them concurrently.wgpu
guarantees that A and B execute concurrently.Below is a list of operations that I want to see documented. For those that I've been able to find answers to, I've included what I know:
Legend
Operations
RenderPass
.dispatch_workgroup()
calls in the sameComputePass
.RenderPass
es in the sameCommandBuffer
.ComputePass
es in the sameCommandBuffer
.RenderPass
and aComputePass
in the sameCommandBuffer
.CommandBuffer
s in the samesubmit()
call.submit()
calls.Queue
before the next call tosubmit()
.Queue
and any of the operations in the nextsubmit()
call.CommandBuffer
.CommandBuffer
and any of the other operations in thatCommandBuffer
.submit()
call followed by anon_submitted_work_done()
call.submit()
? If another thread callssubmit()
in between the current thread callingsubmit()
and callingon_submitted_work_done()
, will the callback wait for the other thread's submission too?Describe alternatives you've considered
Experiment with
wgpu
's behavior by writing actual code and running it on my machine to see how it behaves.Problems with this:
Additional context
I had a nice donut for breakfast this morning.