[Metal] Render/Compute pass forces command buffer finish at the end

lukaschod commented 1 month ago

Description During my investigation of case #5721, I noticed that wgpu produces a lot of command buffers. Each render or compute pass forces a command buffer to finish. After further investigation, I found that this happens because of the resource state tracking code, which injects additional command buffers with the correct resource state changes. This is only relevant in Vulkan, Dx12, and maybe some other APIs that require explicit resource state tracking; it is not required for Metal. As a result, Metal simply produces a lot of empty command buffers and unnecessary splitting.

According to the Metal documentation, it is recommended to keep to a single command buffer for single-threaded apps. For multi-threaded apps, it should be kept per subtask. However, as it currently splits per pass, it makes it quite impossible to apply these recommendations.

This functionality also highly obscures GPU workload debugging with Xcode GPU capture, as it produces a high amount of command buffers and push/pop debug groups only function within a command buffer. This is how it looks now: After removing this command buffer splitting:

As for the performance, I did not notice a significant difference on the GPU side as it really depends on the use case. However, for the CPU, I did notice that consolidating into a single command buffer saves encoding time. Multiple command buffers: Single command buffer:

In any case, the benefit of having multiple versus a single command buffer on Metal highly depends on the use case, adding more reason to give control to the developer instead of forcing splitting.

Repro I made a small patch that disables command buffer splitting, so you could test the difference. This patch is for prototype purposes only as it works only on Metal and would break Vulkan/Dx12. A proper fix would need some conditional execution. do-not-split-command-buffer.patch

Wumpf commented 1 month ago

I wonder if a good first step would be to allow all wgpu-hal implementations to skip empty command buffers. The optimal solution for creating only a single metal command buffer per wgpu command buffer will still likely involve the hal backend to report capabilities of some form, but I'm not familiar enough with the code to be sure - if possible it would be great to solve as much as possible on the the wgpu-hal side and not as little as possible into wgpu-core.

This is also likely to improve issues with profiler markers we have today See also https://github.com/Wumpf/wgpu-profiler/issues/64

ErichDonGubler commented 1 month ago

@Wumpf: That doesn't sound like a bad idea. The only consideration I see there is making sure we don't skip validation mandated by the spec before eliding empty draw calls.

As an example, I noticed that in some transfer functions, we actually return an early success when one dimension of texture data is 0. This causes CTS to fail in a few places in Firefox right now.

lukaschod commented 1 month ago

Yeap, making it not produce empty command buffers would be huge improvement.

gfx-rs / wgpu

[Metal] Render/Compute pass forces command buffer finish at the end #5738