[Impeller] Track per-resource synchronization timelines

bdero commented 1 year ago

This is one approach to resolve https://github.com/flutter/flutter/issues/120399, https://github.com/flutter/flutter/issues/112648, and https://github.com/flutter/flutter/issues/106519 in an efficient way that avoids costly host<->device syncs.

We may or may not want to write an full design doc around what we actually end up doing here (if we end up writing a bigger doc around this approach, feel free to copy any or all of this). But here's the content from the doc I started writing around this topic many months ago:

Definitions

Host/CPU - The computer that stages GPU command buffers.
Device/GPU - The computer that executes command buffers.
Command buffer - A list of commands (primarily, but not exclusively, for raster pipeline executions) constructed on the host and executed on the device.
Recording time - The point at which backend agnostic Impeller command buffers are being recorded.
Encoding time - The point at which the Impeller Renderer converts backend agnostic command buffers into native backend command buffers.
Execution time - The point at which the native backend command buffers are submitted to a driver and execute on the device.
Timeline - A sequence of ordered synchronization events for a given resource.

Device VS Host parallelism

There are two categories of parallelism that this design is concerned with maximizing:

Execution time (GPU) parallelism: The Impeller Entities framework (primarily EntityPass and FilterContents) drip feeds the GPU single command buffers at a time, and all of these command buffers execute synchronously, even in cases where the GPU has free ALUs that it could be using to execute a RenderPass from a different command buffer which happens to not have any common render targets.
Recording/encoding time (CPU) parallelism: Non-collapsed sibling EntityPasses don't have any overlap in terms of the writable resources in the command buffers they construct, and so sibling EntityPasses can be safely encode their command buffers in separate threads and then sent to the GPU in one batch submit. Command recording/encoding isn't trivial! EntityPass performs all kinds of draw call culling tricks and pass simplification to minimize the memory footprint and repeated work on the GPU.

Backend resource timelines

The gist of the problem is that device-backed resource (textures/buffers) access needs to be ordered (except for parallel reads). One possible way to allow Renderer users (like the Entities framework) to produce these "timeline" events for each resource would be to introduce an explicit Semaphore primitive in the Renderer API that Impeller commands can wait on and signal. This way, it's up to Renderer users to hook up these signals to achieve the intended ordering at recording time.

However, another possible approach is to just infer the correct per-resource synchronization timelines at encoding time without having to burden Renderer API users with the need to manage synchronization primitives.

Retain parallelism of device reads

Write operations are hard barriers for ordering, but multiple reads can be grouped together and happen in parallel in-between writes. The resource timeline needs additional state to toggle between a "mutable" mode and an "aliasing" mode. More concretely, reads only need to wait for the previous write to have finished (which is the same as waiting for all of the previous writes to have finished). But writes have the additional constraint of also needing to wait until all of the previously encountered reads have finished.

The below sections describe a minimal example solution for Vulkan 1.1 that retains maximum GPU parallelizability of reads.

Tracked synchronization primitives

First, every resource needs an ordered event timeline, so the backend explicitly tracks this state for every device allocated resource:

Metal: A MTLEvent + a read_start index, defaulting to -1.
Vulkan 1.1: A vector of VkSemaphores + a read_start index, defaulting to -1.
Vulkan 1.2: A timeline VkSamephore + a read_start index, defaulting to -1.
GLES2: No extra state necessary. GLES2 doesn't have command buffers or device synchronization primitives. Encoding time is execution time, and command execution is implicitly synchronous.

Note that all accesses of the resource timeline state should be thread-safe, and the order in which the user adds commands that read/write to textures at recording time should determine how the timeline unfolds (see also the "Thread safety" section below).

Example rules for Vulkan 1.1

Using Vulkan 1.1 as an example, the resource timeline can be tracked with the following rules:

When writing to a resource (i.e. uploading from the host, binding as writable, using as an attachment, using as blit destination, etc.), perform the following actions:
1. If the resource timeline has at least one semaphore:
  - If read_start == -1:
  - Append one semaphore wait command that waits on the last semaphore in the resource timeline.
  - if read_start > -1:
  - For each semaphore with index >= read_start in the resource timeline semaphore list, append a semaphore wait command.
  - Set read_start index to -1.
2. Create a new semaphore and append it to the end of the resource timeline.
3. Append the command which writes to the resource in question.
4. Append a semaphore signal command to signal the semaphore that was appended to the resource timeline in step 2.
When reading from a resource (i.e. binding as readonly, transferring to the host, etc.), perform the following actions in order:
1. If read_start - 1 > -1:
  - Append one semaphore wait command that waits on semaphore read_start - 1 (which is the index of the last semaphore appended for a write operation) in the resource timeline.
2. Create a new semaphore and append it to the end of the resource timeline.
3. If read_start == -1:
  - Set read_start to the index of the new semaphore created in step 2.
4. Append the command which reads to the resource in question.
5. Append a semaphore signal command to signal the semaphore that was appended to the resource timeline in step 2.

Thread safety/nondeterministic timeline ordering

We can get away with making all interactions that happen with the timelines threadsafe as a catch-all. If we did so, dependency logic errors at command recording time would just cause nondeterministic usage order -- which wouldn't be a validation/crash problem, but might not have the intended results. Take this scenario, for example:

Scenario

It's clear that RenderPassA should be evaluated before RenderPassC and RenderPassB should be evaluated before RenderPassD, but it's not clear if RenderPassC should be evaluated before or after RenderPassB. If the user happens to care about this order, the user needs to make sure that the commands which bind or attach TextureA are amended in the correct order. But maybe the user doesn't care, or maybe the user happens to know that all these RenderPasses are commutative, and so it chooses to run the two command encoding tasks in parallel jobs.

bdero commented 1 year ago

We'll need to do a small amount of thread safety work for parallelizing command encoding, but we shouldn't need to use any heavy stop-the-world primitives like VkSemaphores, etc. Instead, all we need are fine-tuned memory barriers in Vulkan (already being done today), and MTLFences (which are lightweight per-resource barriers).

jmagman commented 10 months ago

PSA The engine is in progress to increase the iOS minimum deployment target to 12, so MTLEvent would be available without requiring fallbacks (it's already targeting min macOS 10.14)

API_AVAILABLE(macos(10.14), ios(12.0))
@protocol MTLEvent <NSObject>

flutter / flutter