Open bdero opened 1 year ago
We'll need to do a small amount of thread safety work for parallelizing command encoding, but we shouldn't need to use any heavy stop-the-world primitives like VkSemaphores
, etc. Instead, all we need are fine-tuned memory barriers in Vulkan (already being done today), and MTLFences
(which are lightweight per-resource barriers).
PSA The engine is in progress to increase the iOS minimum deployment target to 12, so MTLEvent
would be available without requiring fallbacks (it's already targeting min macOS 10.14)
API_AVAILABLE(macos(10.14), ios(12.0))
@protocol MTLEvent <NSObject>
This is one approach to resolve https://github.com/flutter/flutter/issues/120399, https://github.com/flutter/flutter/issues/112648, and https://github.com/flutter/flutter/issues/106519 in an efficient way that avoids costly host<->device syncs.
We may or may not want to write an full design doc around what we actually end up doing here (if we end up writing a bigger doc around this approach, feel free to copy any or all of this). But here's the content from the doc I started writing around this topic many months ago:
Definitions
Device VS Host parallelism
There are two categories of parallelism that this design is concerned with maximizing:
EntityPass
andFilterContents
) drip feeds the GPU single command buffers at a time, and all of these command buffers execute synchronously, even in cases where the GPU has free ALUs that it could be using to execute a RenderPass from a different command buffer which happens to not have any common render targets.Backend resource timelines
The gist of the problem is that device-backed resource (textures/buffers) access needs to be ordered (except for parallel reads). One possible way to allow Renderer users (like the Entities framework) to produce these "timeline" events for each resource would be to introduce an explicit
Semaphore
primitive in the Renderer API that Impeller commands can wait on and signal. This way, it's up to Renderer users to hook up these signals to achieve the intended ordering at recording time.However, another possible approach is to just infer the correct per-resource synchronization timelines at encoding time without having to burden Renderer API users with the need to manage synchronization primitives.
Retain parallelism of device reads
Write operations are hard barriers for ordering, but multiple reads can be grouped together and happen in parallel in-between writes. The resource timeline needs additional state to toggle between a "mutable" mode and an "aliasing" mode. More concretely, reads only need to wait for the previous write to have finished (which is the same as waiting for all of the previous writes to have finished). But writes have the additional constraint of also needing to wait until all of the previously encountered reads have finished.
The below sections describe a minimal example solution for Vulkan 1.1 that retains maximum GPU parallelizability of reads.
Tracked synchronization primitives
First, every resource needs an ordered event timeline, so the backend explicitly tracks this state for every device allocated resource:
read_start
index, defaulting to -1.read_start
index, defaulting to -1.read_start
index, defaulting to -1.Note that all accesses of the resource timeline state should be thread-safe, and the order in which the user adds commands that read/write to textures at recording time should determine how the timeline unfolds (see also the "Thread safety" section below).
Example rules for Vulkan 1.1
Using Vulkan 1.1 as an example, the resource timeline can be tracked with the following rules:
read_start == -1
:read_start > -1
:read_start
in the resource timeline semaphore list, append a semaphore wait command.read_start
index to -1.read_start - 1 > -1
:read_start - 1
(which is the index of the last semaphore appended for a write operation) in the resource timeline.read_start == -1
:read_start
to the index of the new semaphore created in step 2.Thread safety/nondeterministic timeline ordering
We can get away with making all interactions that happen with the timelines threadsafe as a catch-all. If we did so, dependency logic errors at command recording time would just cause nondeterministic usage order -- which wouldn't be a validation/crash problem, but might not have the intended results. Take this scenario, for example:
It's clear that
RenderPassA
should be evaluated beforeRenderPassC
andRenderPassB
should be evaluated beforeRenderPassD
, but it's not clear ifRenderPassC
should be evaluated before or afterRenderPassB
. If the user happens to care about this order, the user needs to make sure that the commands which bind or attachTextureA
are amended in the correct order. But maybe the user doesn't care, or maybe the user happens to know that all these RenderPasses are commutative, and so it chooses to run the two command encoding tasks in parallel jobs.