[Roadmap Feedback] Forward Progress Guarantees for Compute Workgroups

devshgraphicsprogramming commented 1 year ago

Problem statement:

There are currenly no Forward Progress Guarantees for compute shader workgroups, this makes certain computation patterns such as having one workgroup wait for another rely on UB.

The foundational issue here is with workgroup interactions. Given even a 2 workgroup dispatch, in a situation when workgroup 1 spins reading a memory location coherently waiting for workgroup 0 to store a certain value to it, the spec gives no guarantees that the driver/gpu-scheduler cannot put workgroup 0 to sleep indefinitely and not have it make any progress.

For example a homebrew work-graph is impossible, as eventually some workgroup executor wouldn't be able to just "grab" extra work while waiting (due to limitation in parallelism) and would spin possibly indefinitely preempting workgroups that actually have work to do and whose completion is necessary for the others to stop spinning.

I am aware that spinning is not efficient, and whenever I resort to it actually do try to grab some other work to do with a workgroup, but there are scenarios when you do work expansion/constraction and the optimal flow cut has too little edges in the (often implicit) graph to saturate all workgroups.

Use Case Example(s):

The primary example would be UE5's "ugly-UB" queue used in Nanite (or so I heard), AFAIK they had to disable that on PC and leave it in for the consoles because only there you can guarantee an ordered dispatch.

Another example I conceptualized would be a CircularBuffer (so Append & Consume in the same dispatch) which would require "toggling" amd "mutexing" the buffer between append/consume states such that appends cannot occur at the same time as consumes, this would require that producers wait for the consumers or vice versa. I am aware that if this was allowed on a per-invocation granularity a subgroup with divergent requests could "deadlock" itself, and loop-peeling or just making sure the requests are subgroup uniform would sidestep that.

I have a single-dispatch implementation of a global Blelloch-Scan (upsweep/downsweep) which uses atomics to assign Virtual Workgoup IDs to "persistent" (air quotes because I have no guarantees the workgroups will be resident on the GPU), then some scratch memory as atomic counters of workgroup completion in each level. This however means that at a certain point, one workgroup might be spinning, waiting on another one.

Presentation for more Context: https://www.youtube.com/watch?v=JGiKTy_Csv8&t=1050s

This would fine as long as we assume either:

all workgroups in a dispatch periodically get their own time-slice (round robining) in finite time
a workgroup cannot be preempted indefinitely and will eventually resume execution without extra interventions
certain types of dispatches simply cannot have their workgroups preempted, i.e. once a workgroup becomes resident it can't suspend

Because my "work graph" assigns work such that:

I launch few enough workgroups so that oversubscription and sleeping workgroups are unlikely
a workgroup with a higher VirtualWorkgroupID only depends on workgroups with a lower VirtualWorkgroupID this also works fine in practice on all the GPUs I tested but one day I might run into a some driver which is preemting at a granularity finer than D3DKMDT_COMPUTE_PREEMPTION_THREAD_GROUP_BOUNDARY and spin forever.

Another approach which worked "in practice" was emulating AMD's ordered dispatch and declaring a "critical section" (implemented with a spinwait till entryIndex+1 was equal to WorkgroupIndex) where we atomicAdd-ed the whole workgroup's sum to the counter and added the old value to our own values. However without the special HW juice, this was 20x slower than the Blelloch Scan.

(Optional) Suggested Solution(s) (via opening an MR on vulkan-docs repo and creating a Proposal Document) :

Not asking for ordered dispatches GCN-style, just a guarantee that a once running workgroup will eventually resume.

The AMDX work-graphs don't meet my needs as I like to use subgroup ops, shared memory and workgroup barriers.

A yield(slicesToSleep) would be cool but not required.

marty-johnson59 commented 1 year ago

Thank you for your suggestion! The Vulkan team very much values your feedback. We're collecting suggestions now and will review them in the Vulkan working group shortly.

Tobski commented 11 months ago

Hi @devshgraphicsprogramming, thanks for your feedback, the Vulkan working group wanted to let you know that we are actually working on something like this, and are aiming to put it into a future roadmap milestone. We can't make any firm commitments at the moment, but we'll be sure to update you once we have more information.

devshgraphicsprogramming commented 11 months ago

Hi @devshgraphicsprogramming, thanks for your feedback, the Vulkan working group wanted to let you know that we are actually working on something like this, and are aiming to put it into a future roadmap milestone. We can't make any firm commitments at the moment, but we'll be sure to update you once we have more information.

As we say in Poland, this is "honey to my ears" 😃

Ipotrick commented 1 week ago

Are there any updates on this?

KhronosGroup / Vulkan-Docs