facebookexperimental / libunifex

Unified Executors
Other
1.48k stars 188 forks source link

Discussion and thoughts: using senders and receivers with GPU async #378

Open Mrkol opened 2 years ago

Mrkol commented 2 years ago

I am trying to wrap vulkan's asynchronous gpu calls (i.e. acquiring swapchain images, submitting rendering work, presenting images to the screen) into unifex stuff. As it turns out, I've bitten a lot more than I can chew, so I'd like to discuss this with people who know what they are doing with unifex, especially considering GPU async being mentioned as a potential P2300 application. Here are my current thoughts. They are probably applicable to any modern low-level GPU API, such as directx12 or metal (I think).

Vulkan has a mechanism for chaining GPU-side async operations: semaphores. An async call (usually) takes two arrays of semaphores: "wait" and "signal". The call does not actually start (even though it is submitted) until all "wait" semaphores are signaled, and after it finishes it signals all of the "signal" semaphores. Semaphores are a purely GPU-side synchronization primitive, therefore it is impossible to await them on the CPU.

How do we eventually join the computation if semaphores cannot be awaited (or polled) from the CPU? The answer to that are fences, which can be signaled from the GPU but polled/waited for only from the CPU. Those usually "terminate" a chain of async GPU computations.

Therefore an entire chain of GPU computations can easily be represented with a sender: when the operation starts, we register our fence with a polling system, and when the computation is done the polling system will notify our operation that the result is ready and we can call set_value() on the corresponding receiver.

This brings us to the first problem: how do we represent partial computations with senders? I.e. those that do not end in a call that can singnal fences. The unifex::sender concept requires that we use the standard set_value channel, but we cannot do that, as the value will never become available CPU-side. And that is OK, since S&R supports other channels. Let's call the new channel work_started (motivation: vulkan calls don't have asynchronous errors AFAIK, they either return an error immediately, or start an asynchronous call that must complete with success, unless the GPU completely crashes). So the plan for starting such a partial operation is as follows: call a vulkan function, in case it returns an error call set_error, otherwise call work_started.

Question: these types of senders and receivers representing partial computations do not satisfy the usual unifex concepts and have totally different semantics form the usual networking stuff. Is this even an intended usage of S&R? Am I on my own in this land of dragons, or is unifex designed to handle customization of algorithms well enough to support this?

As far as I know, there's no general way for a receiver/sender to list async channels that it uses, and therefore no way for generic algorithms to perfectly forward these calls when necessary (e.g. let_value's _successor_receiver).

Onto the next problem. Where do we get those semaphores that are supposed to chain GPU operations from? My current plan is as follows: add 2 new CPOs for GPU-receivers, get_wait_semaphores and get_signal_semaphores. When a GPU-sender gets connected and it's operation is started, it will use these CPOs in order to acquire the relevant semaphores. Lets consider a "simple" example: an analogue of let_value, let_work_started(predecessor, successor) (there's no successor factory, as no result value is produced by work_started). When a sender returned by this function gets connected to a receiver R and the operation op0 is started, the following shall occur

  1. Connect predecessor to a new PredReceiver and start the op1
  2. op1's start function queries PredReceiver for semaphores. The wait semaphore query will be forwarded to R, the signal query will return a semaphore that is created in op0.
  3. op1's start calls work_started on PredReceiver.
  4. PredReceiver's work_started connects successor with SuccReceiver and starts op2.
  5. op2 queries SuccReceiver for semaphores. The wait semaphore is provided by op0, the signal semaphore query is forwarded to R.
  6. op2 calls either work_started or set_value on SuccReceiver (for GPU-only successors and GPU-to-CPU ones respectively), all of which get forwarded to R.

Furthermore, let_work_started could receive the semaphore for op0 as an input as an opportunity for resource reuse (creating semaphores is costly and should NOT be done every frame in a real time rendering application).

A similar construction would be needed for fences to terminate a GPU async chain.

All of this is good and dandy, but another problem awaits. There's this one very important async call in vulkan that does not behave as well as others: vkQueuePresentKHR. It has "wait" semaphores, but does not have neither "signal" semaphores, nor "signal" fences, so this async call is literally unjoinable. The way synchronization is done for this call is described in the next paragraph, but if we wrap this call into a sender, its' behavior will be very strange. It cannot be chained with any further GPU async calls, but can be connected with something that uses work_started, i.e. some kind of a "discard" receiver.

Another problematic and related call is vkQueueSubmit. It can signal both semaphores and fences, and that is exactly the way it is usually used: the semaphore signal is used to chain it with vkQueuePresentKHR, the fence signal is the one that's used to synchronize with CPU and start the next frame rendering job (several rendering jobs might be in flight at the same time by the way). In fact, the waiting for vkQueuePresentKHR does occur on GPU side, but using a totally different mechanism: the GPU job started by vkQueueSubmit internally waits for vkQueuePresentKHR via GPU magic.

How would one go about chaining vkQueueSubmit with both vkQueuePresentKHR and a CPU receiver that'll start the next frame? A potential way to do this would be adding a way to transform a work_started signal into a set_value signal and doing something like let_work_started(vkQueueSubmit, when_all(CPUSender, work_started_to_set_value(vkQueuePresentKHR))), but that would require when_all's receiver to perfectly forward the new CPOs, which it does not support.

So all in all, using S&R for GPU async seems possible, but very quirky. I would very much like to receive some feedback on my ideas and decide whether continuing to pursuing this idea is worth it or not. At the end of it, a rendering subsystem generally is pretty isolated from other stuff and doing ad-hoc concurrency there wouldn't impact the rest of the system much, especially considering that GPU synchronization is so finicky compared to traditional CPU stuff.

kirkshoop commented 2 years ago

If when_all is not forwarding arbitrary 'query' CPOs, that would be a bug (query as in CPOs that take the target by const&).

I am not at all familiar with Vulcan. After reading this, I spent about an hour looking at a Vulcan nbody example.

I think that I would prefer to start by understanding what the ideal usage of a sender/receiver expression would look like (ignoring the specific sender/receiver signals initially). Once we find the right expression, then we can map that to the signals.

Mrkol commented 2 years ago

AFAIK nbody is an example of compute capabilities usage, while I am more interested in rendering capabilities of vulkan. A good example of what I am interested in is this: https://github.com/Overv/VulkanTutorial/blob/master/code/15_hello_triangle.cpp#L643.

Let me briefly explain what all relevant moving parts here do. An OS provides us a resource called a "swapchain". It consists of N images. At any time some of these images might be in use by the user application, while some might be in use by the OS for presentation (pushing out images to the actual display). Every frame the following steps need to occur (lets call them render_frame):

  1. Ask vulkan to acquire an image from the swapchain (vkAcquireNextImageKHR). This call returns an index of the image that will be acquired, but does not block until it is available. It signals a semaphore asyncrhonously instead. (This differs greatly from usual async calls, as it immediately returns a reference to the result, yet you cannot use the result directly, only asynchronously)
  2. Use that image for rendering (vkQueueSubmit)
  3. Tell vulkan that we are done rendering and the image can be presented to the screen (vkQueuePresentKHR).

All of these steps are asynchronous and might be run simultaneously, so we need explicit synchronization with semaphores to sequence them. The intention is to call this render_frame procedure in an infinite main loop.

If we try and follow this plan without any further synchronization, the GPU would get overwhelmed with incoming work and we would also be creating new semaphores without any limit. We need to somehow limit the GPU concurrency, i.e. limit the amount of frame rendering jobs currently submitted but not finished. This is called "inflight frames". Another problem with acquiring unlimited images is the way windowing systems actually work. I honestly have no idea why, but the vulkan spec says that a well-behaving crossplatform application should never acquire more than N - C swapchain images, where C is platform dependent. These considerations force us to balance the amount of in flight frames and the swapchain size such that this situation does not occur. As far as I can tell, the example linked above does not respect this limitation of vulkan and can acquire more images than allowed if we set MAX_FRAMES_IN_FLIGHT to anything greater than 2.

As for inflight frame synchronization, this example creates a fence for each frame in flight (inFlightFences array). These fences form a cyclic buffer and allow us to reuse resources between frames with the same cyclic buffer approach (e.g. the semaphores used for GPU-side synchronization). They are waited for before submitting a new frame rendering job and signalled by the second step (vkQueueSubmit) when it is finished.

Furthermore, the OS might return us the same swapchain image 2 times in a row for some bizarre reason (it's implementation defined), so we not only need to wait on the current inflight fence, but also the inflight fence that was used the previous time this particular swapchain image was acquired. This is done via the imagesInFlight reference array, indexed by swapchain image index.

In order to "jobify" this algorithm, the only thing we need is to get rid of the blocking calls, namely vkWaitForFences (which can block on OS access, but as stated above, not on GPU access) and vkAcquireNextImageKHR. As I have stated in the oppost, currently the only way to do so is to dedicate a pool of "sleeper" threads for these blocking tasks and delegate the work of waiting to them, while keeping the main application threadpool block-free.

How would I like for this algorithm to look with unifex and structured concurrency? Well, that's a hard question. If we disregard imagesInFlight, something like this maybe?

auto render_frame(size_t inflight_idx) {
    return
        acquire_next_image_async(swapchain)
        | chain(imageAvailableSemaphores[currentFrame],
            [](uint32_t image_idx)
            {
                return when_all(queue_submit_async(...), just(image_idx));
            })
        | chain(renderFinishedSemaphores[currentFrame],
            [](uint32_t image_idx)
            {
                return present_async(image_idx);
            })
        | chain( // no semaphore here, only "work started" chaining
            []()
            {
                return acquire_next_image_async(swapchain)
            });
};

task<void> main_loop() {
   // see #355 
   static_scope<???, 32> frames_scope(actual_swapchain_size);
   size_t frame_idx = 0;
   while (true) {
       co_await frames_scope.spawn(render_frame(frame_idx++ % MAX_IN_FLIGHT_FRAMES));
   }
}

Ideally, I would like to use coroutines, but that wouldn't let us specify semaphores in let_work_started/chain.

task<void> render_frame()
{
    auto& fence = inFlightFences[inflight_idx];

    // awaits a blocking OS call
    auto swapchain_image_idx = co_await acquire_next_image_async(g_swapchain);

    auto& image_fence = imagesInFlight[swapchain_image_idx];

    if (image_fence) {
        // awaits a blocking wait for GPU call
        co_await wait_fence_async(image_fence); // (2)
    }

    image_fence = fence;

    // ...

    // does not actually wait for anything, only links with the image acquiring via a semaphore somehow
    co_await queue_submit_async(...) // (3)

    // does not wait, links with the previous call via a semaphore
    co_await present_async(...);

    // waits for GPU to finish this rendering job 
    co_await wait_fence_async(fence);

    // when the coroutine finishes, the main loop will be able to spawn a new frame operation on the static scope
}

P.S. I am honestly starting to doubt whether this endeavour is worth it. Most of the actual useful async work is done in vulkan's GPU command buffers, and I am afraid to even imagine what would wrapping those into S&R look like, and not sure if it's even worth it.

Mrkol commented 2 years ago

I have just noticed that in #13 Lewis Baker states that unifex aims for the possibility of chaining IO operations (presumably with IO_uring) kernel-side. Implementing that is very similar to what I am trying to do with vulkan operation chaining. The same problem arises: a different channel is needed for chaining at start time (as opposed to chaining on completion). As far as I can tell, the io_uring context currently does not implement this, so if someone who is proficient with io_uring could try implementing this with a new work_started channel and a receiver query for the current IO operation buffer (similar to my semaphore queries), and see whether this approach works, that would be wonderful