Can we get more queues?

MennoVink commented 3 years ago

Can we get more queues, especially in the graphics family?

I'm in a situation where i'm heavily bottlenecked by queue submissions. 45% of my frame time is spent in transcoding vulkan commands to metal. The problem i have with this is that queue access must be synchronised, which means two things:

Even if i offload submits to be asynchronous and can submit fulltime this would still be too slow in comparison to windows' performance. Being twice as fast i would still be bottlenecked by the transcoding.
I can't submit transfer acquire barriers in parallel to graphics command transcoding, and thus my background resource loading threads are blocked waiting for the queue to unlock.

With multi-queue support i'm expecting that these queues can transcode in parallel before they have to synchronise on the metal queue for submitting. That way my resource loaders will only have to take care of transcoding their own command buffers, but they dont have to wait until the graphics command buffer that's currently being submitted is done transcoding.

I have seen the prefillMetalCommandBuffers option. While this option does seem to allow parallel transcoding, it does introduce quite a few limitations. Maybe more importantly, it imposes quite a restrictive design. The recording would have to be done on separate threads, which in my case is kind of overkill just to record a few simple commandbuffers. Especially since this same system running on windows has no issue at all.

billhollings commented 3 years ago

Can you clarify a few points, please?

The purpose of prefillMetalCommandBuffers was to allow parallel transcription onto multiple Metal command buffers, which seems to be your bottleneck. It does come with some restrictions, and is the primary reason why MoltenVK only supports one queue per queue family (because a Metal command buffer has to originate from the queue it will execute on, but Vulkan only applies this restriction on a queue family).

The recording would have to be done on separate threads, which in my case is kind of overkill

If recording commands using parallel threads is not in the picture, I'm not sure I understand where you envision the additional parallelization coming from.

With multi-queue support i'm expecting that these queues can transcode in parallel before they have to synchronise on the metal queue for submitting.

Can you clarify what you mean by that? Are you envisioning multiple Vulkan queues, but only one Metal queue? Or are you looking for multiple Metal queues?

If the former, it's possible we might be able to associate a single Metal queue with a Vulkan queue family, and allow multiple Vulkan queues to transcribe to it via parallel submits, as long as the app was able to sync those submits appropriately to avoid race conditions.

If the latter, as long as prefillMetalCommandBuffers was not enabled, we could support more queues (both Vulkan and Metal) per queue family, and the app would handle all the sync between submissions across multiple Metal queues.

Would you be able to layout a description of an order-of-operations, so we can understand the flow you are envisioning across the multiple Vulkan queues requested?

MennoVink commented 3 years ago

I'm not using prefillMetalCommandBuffers. I think if you keep one metal queue for each vulkan queue family then you can still have multiple vulkan queues. Command buffers are created from pools and pools are created for families. So command buffers got a one to one queue family relationship, and thus you can always submit them to that family's metal queue.

-The additional parallelization comes from parallel submits, not recordings. -I'm looking for multiple vulkan queues, submitting to their family's single metal queue.

Note the App/MVK prefix in each lane describing where it's implemented.

In the top left is how i expect this to affect rendering. The RenderThread will dispatch submits to a different thread and then continue recording on it's own thread. What happens here is that if the render thread submits before the previous asynchronous submit is done (ie still transcribing), a second thread will pick it up. That second thread will find another vulkan queue that is free in the graphics family and submit to that. This second vulkan queue's transcribing executes in parallel with the previous submit's transcribing. Submit threads may execute out of order, so for synchronization i'm using timeline semaphores. These allow for wait operations to be submitted before signal operations. I'm not sure if metal also provides this property. If it does then the vulkan queues can submit the metal command buffer as soon as they're ready, otherwise they need to delay submit on the cpu side until the appropriate signal has been reached.

On the right and in the bottom is showing how i think this would affect background resource loading threads. While the render thread holds a lock on vulkan graphics queue 1 during it's present (which takes long due to vsync), the resource loader can freely submit an acquire barrier to the secondary graphics queue. I dont know how vsync works in MVK, but if the eventual submit to the metal graphics queue needs to be delayed until after the vsync that would be fine. Note that the SubmitAsync calls are signals, they dont block on execution, instead they return a std::future like object. For my resource loaders i'm using fences to poll when transfers are done, if the metal submit needs to be delayed to next frame then that fence will just keep returning not-ready for a bit longer. What is important here is that the transcribing can still be done in parallel with vkQueuePresent and other resource loaders running simultaneously.

This flow only shows two graphics queues but I'd probably need more. If you look at the rendering section then you'll see that if there's a 3rd async submit then that would need a 3rd submit thread + vulkan queue to prevent blocking with the first two submits. In my case i have 5 passes, but this is application specific. The number of required queues also depends on the cost of recording vs transcribing, if recording is free then it needs a vulkan queue for each submit. For the resource loader threads however i have as many threads as the number of cores the machine has, ie 28 on a Mac Pro. This is configured for windows where almost all time is spent on file io, decoding, decompression, compression and encoding. For macOS i probably can do with fewer resource loaders as the async submit threads will fill out a couple of cores with transcribing.

MennoVink commented 3 years ago

Mhh after working with timeline semaphores on a multiqueue system a bit more i think i'm running into the note mentioned here https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#synchronization-implicit. I'm not sure, but it seems that timeline signals cannot be submitted out of order. That will prevent the parallel transcribing for multiple submits within a single frame from being able to work this way because it means submissions must still be ordered and thus a submission can only be made after the last one is finished, thereby preventing any parallelism.

Even when multiple queues cant be used to optimize transcribing, having at least two could still be beneficial though. The part where resource loaders cant submit while the queue is locked for presentation still holds true. When a resource loading thread has a command buffer created with prefillMetalCommandBuffers enabled ready to submit, it must still access a queue that needs to be synchronized when accessed. The vsync behaviour i'm seeing is currently like this:

nvidia on windows does the vsync block in vkQueuePresentKHR. Queue access must be synchronized, but nvidia provides 16 graphics queues so a secondary queue can be used for submitting during this block.
amd on windows does the vsync block during vkAcquireNextImageKHR. Amd only provides a single queue, but because the queue doesn't need to be synchronized during image acquisition there is no block here.
mvk i'm not sure where the vsync itself is, but rendering a single triangle 5% is spent in acquire and 20% is spent in vkQueuePresentKHR. There is only a single queue, so any submits to it must wait until the vkQueuePresentKHR call has returned.

MennoVink commented 2 years ago

@billhollings Initially i made this request because i thought it could be used for parallel command recording. That wasn't really possible because i misunderstood timeline semaphores. I then kept the issue open because i was still doing presentation myself which interfered with async transfers. I've since moved on to display links however so for me this is a non-issue now. Think we should close it?

billhollings commented 2 years ago

so for me this is a non-issue now. Think we should close it?

Closing for now.

KhronosGroup / MoltenVK

Can we get more queues? #1473