Queue, Command Buffer & Fence Systems

fcturan20 commented 2 years ago

To make TGFX a lower-level API; we need to provide queue, command buffer and fence functionality to user/RG. Design, implementation & documentation progress will be in this issue until it's fully tested & finalized.

Queues:

D3D12 & Vulkan has concept mismatch about Queues. For simple applications that submit CmdBuffers in execution order; Vulkan is same as D3D12 if D3D12 backend execute “CmdQueue->Wait(), CmdQueue->ExecuteCommandList(), CmdQueue->Signal()” in this order.

Vulkan:

There is something called implicit synchronization between submits. Implicit sync provides that a later submit is processed only if a previous one is finished processing (execution finish differs). A submit is processed only after all wait semaphores are satisfied. So if a previous submit is waiting for some semaphores, later submits are waiting too. Between different VkQueueSubmit calls, there is no guarantee that previous call’s execution finished before later one starts processing. Different types of queue operations should be synchronized with binary semaphores.

To get multiple operations at the same time, all we need to do is submit with same/zero semaphores and order submits well.

D3D12:

A queue is an sequential execution engine. Between different ExecuteCmdList calls, it is guaranteed that previous call’s all cmds finished executing before processing later ExecuteCmdList call. This works fine with different types of queue operation types. For ex: ExecuteCmdList, Present call works fine without any semaphore because all CmdBuffer execution finishes before Present is proccessed.

To get multiple operations at the same time, you have to use multiple queues (even in the same queue family). All submits should be as minimal as possible. This call order will provide that unrelated cmdBuffers will be executed at the same time with CommonSwapchain ones but will be synchronized right before Present.

EXAMPLES:

Single Queue, Single Family → Vulkan best, DX12 worst: ExecuteSpecificUnrelatedCmdBuffers-Signal, ExecuteCommonSwapchainRelatedCmdBuffers-Signal, Wait-ExecuteSwapchainRelatedCmdBuffers-Signal, Wait-Present.

Multiple Queue, Single Family → Works best on both APIs: ExecuteCommonSwapchainRelatedCmdBuffers-QueueG, ExecuteSpecificUnrelatedCmdBuffers-QueueC, Signal-QueueC, ExecuteSwapchainRelatedCmdBuffers-QueueG, Wait-QueueG, Present-G.

DIFFERENCE:

To get best GPU performance, we have to use multiple queues per family in both APIs. Vulkan has an extra feature that single queue may overlap executions too (but in order & semaphore constraints). This makes multiple queue usage unnecessary in general, but it’s hard to understand as a whole. So this is hard to get advantage of and most users won’t get advantage (not even understand). Forcing users to use multiple queues is better (DX12 way). Lots of submissions turn into worst CPU performance on VK but this should be handled with QueueSubmit call to send all calls at once.

HOW IT WILL BE IMPLEMENTED:

Queue Submission API should be same as D3D12 (linearly executed list of commands, there is no submission list and wait/signal semaphore list). But we need to make sure a laterly sent submission shouldn’t be processed before previously sent finished execution. We can achieve this in VK by adding an extra semaphore to wait for previously sent submission to finish execution. Same works for presentation as well. If user wants to get best performance, they’ll use multiple queues per family. Because we sync process-execution of cmdBuffers between both APIs, inter-queue sync does same thing between both APIs.

So if user wants to get best CPU & GPU performance: Call ExecuteCmdLists as less as possible and use multiple queues (same family or not) as much as needed (but no more), because more’ll decrease CPU performance in VK.

There is no need to support QueueFamilyOwnershipTransfer until DX12 is implemented. All resources should be shared by all queue families. After DX12 is implemented, we can find a common ground.

fcturan20 commented 2 years ago

5ce1f42c31678c37130470430629978b78706c69 implemented:

TGFX's design about queues and fences. Fences can be signaled from both CPU (setFence & getFenceValue) and GPU (queueFenceWaitSignal).
As VK doesn't support such fences, backend uses Timeline Semaphores extension to support TGFX fence. But swapchain operations (present & acquire) doesn't support timeline semaphores, so we need to write a converter.
TGFXVulkan chooses a queue (name is internal queue) to do binary -> timeline semaphore conversion.

fcturan20 commented 2 years ago

a6583344057b23de58fa1755c505962cebd6f60d:

Timeline -> Binary semaphore conversion by using the Present's queue. Ideally, present shouldn't wait for anything (already will wait for the queue's last CmdExecuteList).
Presenting with waiting for the binary semaphores defined in Step1.
Acquiring next image by signaling image's binary semaphore (in WINDOW_VKOBJ).
Submit: Waits = these image binary semaphores, Signals = queueFenceWaitSignal's Signals. This submit is sent to internal queue.

fcturan20 commented 2 years ago

1a4d161e688d8f751020d7a0a85fc0a33b770eb2: All swapchain textures are presented at swapchain creation time at moved to VK_IMAGE_LAYOUT_GENERAL after that. Code needs refactoring but creating the swapchain only once works fine right now. As DX12's default behaviour is this, we're making sure that Vulkan is on the same page. In future, we may use extra functionality in VK backend to avoid unnecessary transition to GENERAL layout. Currently on my device: All presentation modes (FIFO, Mailbox, Immediate) works with 2-3 swapchain texture counts without any validation layer error (neither in synchronization mode).

fcturan20 / TuranLibraries

Queue, Command Buffer & Fence Systems #5

Queues: