flutter / flutter

Flutter makes it easy and fast to build beautiful apps for mobile and beyond
https://flutter.dev
BSD 3-Clause "New" or "Revised" License
166.22k stars 27.5k forks source link

[Impeller] Track per-resource synchronization timelines #120406

Open bdero opened 1 year ago

bdero commented 1 year ago

This is one approach to resolve https://github.com/flutter/flutter/issues/120399, https://github.com/flutter/flutter/issues/112648, and https://github.com/flutter/flutter/issues/106519 in an efficient way that avoids costly host<->device syncs.

We may or may not want to write an full design doc around what we actually end up doing here (if we end up writing a bigger doc around this approach, feel free to copy any or all of this). But here's the content from the doc I started writing around this topic many months ago:

Definitions

Device VS Host parallelism

There are two categories of parallelism that this design is concerned with maximizing:

  1. Execution time (GPU) parallelism: The Impeller Entities framework (primarily EntityPass and FilterContents) drip feeds the GPU single command buffers at a time, and all of these command buffers execute synchronously, even in cases where the GPU has free ALUs that it could be using to execute a RenderPass from a different command buffer which happens to not have any common render targets.
  2. Recording/encoding time (CPU) parallelism: Non-collapsed sibling EntityPasses don't have any overlap in terms of the writable resources in the command buffers they construct, and so sibling EntityPasses can be safely encode their command buffers in separate threads and then sent to the GPU in one batch submit. Command recording/encoding isn't trivial! EntityPass performs all kinds of draw call culling tricks and pass simplification to minimize the memory footprint and repeated work on the GPU.

Backend resource timelines

The gist of the problem is that device-backed resource (textures/buffers) access needs to be ordered (except for parallel reads). One possible way to allow Renderer users (like the Entities framework) to produce these "timeline" events for each resource would be to introduce an explicit Semaphore primitive in the Renderer API that Impeller commands can wait on and signal. This way, it's up to Renderer users to hook up these signals to achieve the intended ordering at recording time.

However, another possible approach is to just infer the correct per-resource synchronization timelines at encoding time without having to burden Renderer API users with the need to manage synchronization primitives.

Retain parallelism of device reads

Write operations are hard barriers for ordering, but multiple reads can be grouped together and happen in parallel in-between writes. The resource timeline needs additional state to toggle between a "mutable" mode and an "aliasing" mode. More concretely, reads only need to wait for the previous write to have finished (which is the same as waiting for all of the previous writes to have finished). But writes have the additional constraint of also needing to wait until all of the previously encountered reads have finished.

The below sections describe a minimal example solution for Vulkan 1.1 that retains maximum GPU parallelizability of reads.

Tracked synchronization primitives

First, every resource needs an ordered event timeline, so the backend explicitly tracks this state for every device allocated resource:

Note that all accesses of the resource timeline state should be thread-safe, and the order in which the user adds commands that read/write to textures at recording time should determine how the timeline unfolds (see also the "Thread safety" section below).

Example rules for Vulkan 1.1

Using Vulkan 1.1 as an example, the resource timeline can be tracked with the following rules:

Thread safety/nondeterministic timeline ordering

We can get away with making all interactions that happen with the timelines threadsafe as a catch-all. If we did so, dependency logic errors at command recording time would just cause nondeterministic usage order -- which wouldn't be a validation/crash problem, but might not have the intended results. Take this scenario, for example:

Scenario

It's clear that RenderPassA should be evaluated before RenderPassC and RenderPassB should be evaluated before RenderPassD, but it's not clear if RenderPassC should be evaluated before or after RenderPassB. If the user happens to care about this order, the user needs to make sure that the commands which bind or attach TextureA are amended in the correct order. But maybe the user doesn't care, or maybe the user happens to know that all these RenderPasses are commutative, and so it chooses to run the two command encoding tasks in parallel jobs.

bdero commented 1 year ago

We'll need to do a small amount of thread safety work for parallelizing command encoding, but we shouldn't need to use any heavy stop-the-world primitives like VkSemaphores, etc. Instead, all we need are fine-tuned memory barriers in Vulkan (already being done today), and MTLFences (which are lightweight per-resource barriers).

jmagman commented 10 months ago

PSA The engine is in progress to increase the iOS minimum deployment target to 12, so MTLEvent would be available without requiring fallbacks (it's already targeting min macOS 10.14)

API_AVAILABLE(macos(10.14), ios(12.0))
@protocol MTLEvent <NSObject>