KhronosGroup / MoltenVK

MoltenVK is a Vulkan Portability implementation. It layers a subset of the high-performance, industry-standard Vulkan graphics and compute API over Apple's Metal graphics framework, enabling Vulkan applications to run on macOS, iOS and tvOS.
Apache License 2.0
4.83k stars 428 forks source link

Issues using MTLFence for VkSemaphore #796

Open apayen opened 4 years ago

apayen commented 4 years ago

Hello,

From what I understand, it seems the scopes for MTLFence and VkSemaphore are radically different.

The vulkan specs states that semaphore are intended for queue<->queue synchronization:

However, metal specs states that:

So a MTLFence's scope only includes commands in the encoder where it's being signaled/waited upon.

This means that, if we have 3 encoders A, B and C, and a MTLFence signaled by A waited by B, we have the guarantee that B will execute after A is completed. But C is free to execute in parallel to both A and B (whereas a VkSemaphore would guarantee that both B and C would wait on A to complete).

Moreover, the current pattern to inject an empty blit encoder to wait or signal a semaphore means that signaling and waiting scope are empty, and are not actually synchronizing anything.

I'm assuming though that it has been tested and found to work on Mac at least. So I'd assume that Mac's GPU doesn't have (for now) the technical capability to do fine-grained encoder only scopes, and result in a MTLFence behaving like a MTLEvent. But this completely breaks on iOS (using iPhone XS 13.2.3).

By un-allowing VkSemaphore to use the MTLFence implementation, and reverting to the use of MTLEvent, the synchronization works as expected.

MTLFence however is appropriate to handle subpass synchronization inside a vk render pass, as it exactly matches VkSubpassDependency's semantic (assuming each subpass have its encoder).

billhollings commented 4 years ago

@apayen

So a MTLFence's scope only includes commands in the encoder where it's being signaled/waited upon.

This does not appear to be the case. The old Apple Metal Programming Guide specifically includes examples discussing cross-encoder synchronization. And even the MTLFence description you quoted above includes "across command encoders".

It is, however, likely true that MTLFence does not support inter-queue sync.

the current pattern to inject an empty blit encoder to wait or signal a semaphore means that signaling and waiting scope are empty, and are not actually synchronizing anything

Can you clarify what you mean by this? Even though the new encoder itself is empty and has no scope itself, the command buffer should guarantee that the scope of any encoders before or after this empty encoder will be synchronized.

There was a lot of discussion about the use of MTLFence and MTLEvent to implement the synchronization of VkSemaphore. Definitely MTLEvent seems to be more suitable from a design perspective.

But one of the main drivers behind using MTLFence is that it seemed to be more stable under conditions here and here.

It may be that we need to figure out how to better make use of MTLEvent when it comes to VkSemaphore.

Currently, the use of either MTLFence or MTLEvent pass the same set of Vulkan CTS tests.

It would be very useful to find some solid test cases that would help us determine the stability of using MTLEvents for VkSemaphores.

apayen commented 4 years ago

So a MTLFence's scope only includes commands in the encoder where it's being signaled/waited upon.

What I meant by this is that the signal operation only wait for commands inside the signaling encoder to complete, and the wait operation only stall commands inside the waiting encoder. Of course, the signaling and waiting encoder are different encoders; hence the cross-encoder synchronization.

Even though the new encoder itself is empty and has no scope itself, the command buffer should guarantee that the scope of any encoders before or after this empty encoder will be synchronized.

This is what seems to be the wrong assumption.

Metal spec:

Updates the given fence to capture all GPU work enqueued by the command encoder up to this specific point

Prevents further GPU commands from being enqueued by the command encoder until the given fence is reached.

Notice that Apple does not use plural here.

This is further supported by the fact that, signaling/waiting MTLEvent must be done outside of any render encoder, which clearly indicates that all encoders before the event are required to complete before signaling, and all encoders after the event must wait.

On the other hand, MTLFence must be signaled/waited within a command encoder, which hints that the synchronization scope is indeed limited to that encoder (otherwise, it would be legal to signal/wait outside of an encoder as well).

Agreed that apple docs are not as explicit as vkspec, but there is no implicit guarantee that an encoder submitted after another will inherit its waiting fences.

From what I understand, the purpose of MTLFence is to create a graph of command encoders, similar to how vulkan render passes works: you can submit subpasses A B C D E F, and define dependencies A->B, B->F, C->D, D->F, E->F, resulting in the following graph:

A->B-\
C->D-+->F
E----/

With an order guarantee this would turn into (with C implicitly waiting on A due to A->B, and E implicitly waiting on C due to C->D)

A->B----\
   C->D-+->F
      E-/

Which greatly reduces the parallelism (though in this sample one could cleverly reorder and record E first - more complex graphs may be defined)

But one of the main drivers behind using MTLFence is that it seemed to be more stable under conditions here and here.

I have just found an issue (Acquiring a swapchain image not signaling the semaphore properly if the image is not readily available), for which I will open another issue. This could be linked to the GPU timeout issue.

billhollings commented 4 years ago

Metal spec:

Updates the given fence to capture all GPU work enqueued by the command encoder up to this specific point

Prevents further GPU commands from being enqueued by the command encoder until the given fence is reached.

Notice that Apple does not use plural here.

However...the specs also say:

Fences are manipulated when the command buffer is submitted to the GPU. This maintains global order and prevents deadlock. Fences are evaluated at command encoder boundaries. Waits occur at the beginning of an encoder and updates occur at the end of the encoder.

AFAIK, the encoders are just a way to get the commands into the command buffer. In doing so they install the fences at the boundaries.

The command buffer itself ensures that commands are executed in order.

The promise upheld by Metal is that the perceived order in which commands are executed is the same as the way you ordered them.

So commands executed by one encoder (including the fence boundaries) should execute in order...and commands from prior encoders should be executed before a wait on a future encoder, and the commands in subsequent encoders should execute after an update on the fence at the tail of a previous encoder.

Commands in all prior encoders are in the scope of a wait on the current encoder...and commands in all subsequent encoders are in the scope of an update at the end of the current encoder.

billhollings commented 4 years ago

Commands in all prior encoders are in the scope of a wait on the current encoder...and commands in all subsequent encoders are in the scope of an update at the end of the current encoder.

Sorry...I said that backwards...

Commands in the current and subsequent encoders are in the scope of a wait on the current encoder...and commands in the current and past encoders are in the scope of an update at the end of the current encoder.

billhollings commented 4 years ago

Having said all this...if we can figure out what sometimes goes wrong with the MTLEvent implementation of VkSemaphore...I'd be happy to make MVK_ALLOW_METAL_EVENTS the default mechanism.

apayen commented 4 years ago

Having said all this...if we can figure out what sometimes goes wrong with the MTLEvent implementation of VkSemaphore...I'd be happy to make MVK_ALLOW_METAL_EVENTS the default mechanism.

I absolutely understand that you want to fix what is wrong with MTLEvent before reverting to using it, and support that.

The following is not to convince you to revert to MTLEvent straight away. But I think that, moving forward, it is critical to understand precisely what MTLFence does. Sadly Apple docs are not the best, hence the argument we're having.

The promise upheld by Metal is that the perceived order in which commands are executed is the same as the way you ordered them. While Metal might reorder some of your commands before processing them, this normally only occurs when there's a performance gain and no other perceivable impact.

What I understand here is that, for instance, blending would process in order, so that two alpha blended plane on top of each other are guaranteed to appear in the submission order. This guarantee is also provided by Vulkan.

However this guarantee is limited only to fragment ordering, not resource hazard, as hinted here

This forces the Gaussian blur filter [a compute encoder] to wait for the downsample filter [a blit encoder] to complete its work before beginning its own work. A waiting period is necessary because the Gaussian blur filter depends on dynamic texture data generated by the downsample filter. Without the fence, the GPU could execute both filters [encoders] in parallel, and thus read uninitialized dynamic texture data allocated from the heap.

This contradicts "this normally only occurs when there's a performance gain and no other perceivable impact." from the previous quote from Apple's doc, showing that perceived order does not apply here.

If there is a render encoder, followed by an empty blit encoder signaling a fence, the blit encoder is free to execute in parallel to the render encoder and signal the fence straightaway, without waiting for the render encoder to complete.

If there is a blit encoder waiting a fence, a render encoder that has been placed after is free to execute in parallel to the blit encoder (or even before). The render encoder has no reason to wait for a fence which is waited inside the blit encoder

If we have two render encoders, the perceived order guarantees would at best guarantee that, if they are rendering to the same targets, the blending would be ordered properly (to be confirmed though that the guarantee indeed works across encoders). But vertex (and fragment if not doing programmable blending) are free to be completely reordered/parallelized.

AFAIK, the encoders are just a way to get the commands into the command buffer.

They also define a synchronization scope for commands inside it. This is not explicitly said (curse Apple), but the API strongly suggests that:

MTL describes MTLRenderCommandEncoder as

The object to use for encoding commands for a render pass.

If we consider a Render Encoder to be the equivalent of a Vulkan subpass, and MTLFence to be an execution dependency (between two subpasses), then all the API limitations makes a lot of sense.

https://www.khronos.org/registry/vulkan/specs/1.1/html/vkspec.html#renderpass

Execution of subpasses may overlap or execute out of order with regards to other subpasses, unless otherwise enforced by an execution dependency [which would be a MTLFence in metal]. Each subpass only respects submission order for commands recorded in the same subpass, and the vkCmdBeginRenderPass and vkCmdEndRenderPass commands that delimit the render pass - commands within other subpasses are not included [in submission order]. This affects most other implicit ordering guarantees.

Not being able to alter render targets within the same encoder, and being able to wait a MTLFence only at the beginning and signal it only at the end. It's almost a 1::1 mapping, except vulkan requires the full render passes and dependencies to be defined ahead of it's actual command encoding.

TellowKrinkle commented 2 years ago

However this guarantee is limited only to fragment ordering, not resource hazard, as hinted here

This is because MTLHeaps default to MTLHazardTrackingModeUntracked. MVK does not currently use anything with that hazard tracking mode, so MTLFences will not provide anything over the existing automatic hazard tracking (except additional unnecessary waiting).

I can confirm that MTLFences only apply to the current command encoder though. I messed that up in PCSX2's renderer. I had the following command buffers:

  1. Vertex Upload command buffer. Contains a blit encoder that copies vertex data from a CPU buffer to an untracked GPU buffer, then updates a MTLFence (goal is to make it so that even though the same buffer is being used for every draw, this upload can run on a DMA queue in parallel to any previous draws)
  2. Draw command buffer. Contains render encoders that consume the vertex data from (1). The first (and only first) encoder on this buffer contains a wait on the MTLFence

Everything worked fine on Intel, Nvidia and AMD, where the GPU never reordered the encoders in (2), probably because they all transitively depended on the first encoder in their respective fragment shaders. But when people started testing it on M1, which runs vertex shaders and fragment shaders separately, they got massive explosions of triangles across the whole screen. Turns out it was moving all vertex shader runs from the draw command buffer that weren't in the first encoder ahead of the vertex upload. Making every encoder in the draw command buffer wait for the fence fixed the issue.

So if we did want to use MTLFences and MTLHazardTrackingModeUntracked, we'd have to ensure every command buffer waited for a MTLFence associated with every previously-waited-on VkSemaphore (not just the most recent one), which sounds like a pain. Probably better to just keep MTLHazardTrackingModeUntracked off where MTLEvent isn't available.