Add support for transfer queues so assets can be loaded concurrently with rendering.

Is your feature request related to a problem? Please describe. Right now, you can only have one command queue to talk to WGPU. The main thread does all the work. This cuts frame rates way down when trying to load assets into the GPU from multiple threads. I can get 60FPS when other threads are not loading assets, but that drops to 10FPS or less when assets are being loaded.

Here's a video of an example. This is my Sharpview metaverse client, which uses Rend3 and WGPU.

https://video.hardlimit.com/w/7usCE3v2RrWK6nuoSr4NHJ

Look at all that highly detailed content. It's not preloaded. Assets are being frantically loaded into the GPU ahead of the player's travels through a huge world. 16 threads are handling the content loading in parallel with rendering. That works fine. But the final step, where the data gets copied into the GPU, is done, by WGPU, in the render thread. This slows the frame rate way down, sometimes below 10FPS.. If the player stays in one place, soon the asset loaders catch up, and the frame rate comes back up to the normal 60FPS.

This affects both Bevy and Rend3.

More discussion at https://github.com/gfx-rs/wgpu/discussions/5525

Describe the solution you'd like Support for transfer queues, as supported by Vulkan, DX12, and Metal. (WebAssembly and Android may not support this yet.)

A transfer-only queue has simple interlocking and is fast, because a data transfer to an empty buffer doesn't depend on anything else happening first. That's easier to implement than multiple render queues. I'd rather have transfer queues working in a few months than multiple render queues working in a few years.

From the Vulkan dev guide:

Data upload is another section that is very often multithreaded. In here, you have a dedicated IO thread that will load assets to disk, and said IO thread will have its own queue and command allocators, hopefully a transfer queue. This way it is possible to upload assets at a speed completely separated from the main frame loop, so if it takes half a second to upload a set of big textures, you don’t have a hitch. To do that, you need to create a transfer or async-compute queue (if available), and dedicate that one to the loader thread. Once you have that, it’s similar to what was commented on the pipeline compiler thread, and you have an IO thread that communicates through a parallel queue with the main simulation loop to upload data in an asynchronous way. Once a transfer has been uploaded, and checked that it has finished with a Fence, then the IO thread can send the info to the main loop, and then the engine can connect the new textures or models into the renderer.

Unreal Engine has been doing that since UE4.

Describe alternatives you've considered

Dropping WGPU and only supporting Vulkan platforms. (Windows and Linux, but not MacOS).
Bypassing WGPU and using a transfer queue that WGPU doesn't see.

Additional context I have a test case at https://github.com/John-Nagle/render-bench This tests how badly main-thread rendering is impacted by what other threads are doing with assets.

Bypassing WGPU and using a transfer queue that WGPU doesn't see.

That's a fairly feasible option in the interm. For e.g. vulkan, you would:

Create your own vulkan device/queue/transfer queue
Use Adapter::create_device_from_hal() to create your regular wgpu Device/Queue
Create a background thread, give it the wgpu::Device and transfer queue, and create a vulkan command pool and fence
Background thread allocates a vulkan buffer (Problem: Need to access wgpu's memory allocator, else manage allocations yourself) for staging, and a second that's device local.
Background thread loads an asset from disk, and decompresses it into the staging buffer
Background thread creates a command buffer from the pool and records a vkCmdCopyBuffer from staging to device local (it might be a good idea to load several assets at once, and use multiple copy commands per command buffer, otherwise you'll have a ton of submits and individual command buffers which is bad for performance, but I'm not sure how engines tend to structure this)
Background thread submits the command buffer to the transfer queue, along with the fence
Background thread waits for the fence to signal
Background thread calls wgpu::hal::vulkan::Device::buffer_from_raw() and wgpu::Device::create_buffer_from_hal() on the device local buffer to create a wgpu::Buffer (Important note: dropping the buffer does not free the GPU memory like it normally does. You'll have to manually free it yourself. Also, buffer can't be mapped, but that's not relevant to you.)
Background thread sends the finished buffer to the main thread or whatever needs it via a channel

That's an option, but a desperation one. I'd hate to have to go into the innards of WGPU's allocation system. Also, I'd be giving up MacOS support, which is the main point of using WGPU rather than Vulkan directly. I saw a note that the Bevy devs are considering such a bypass. If they do it, I'll have an example to look at.

Because I'm using WGPU via Rend3, I'd need my own version of Rend3, too.

Another alternative is to fork Rend3, rip out the connection to WGPU, and replace that with Vulkano. That would be a relatively clean and safe Rust solution. Rend3 has a well-designed, clean API, and that's worth retaining.

These are all ugly hacks. Better if transfer queues are implemented inside WGPU, where they belong.

Here's the proposed Bevy workaround for this problem: That's a plan to bypass WGPU and go directly from Bevy to Vulkan. Comment from the Bevy issue: "This is a bit hacky, and relying on globals in the form of static OnceLock-ed variables, but may be reasonable until wgpu supports multiple queues."

JMS55 commented: Background thread creates a command buffer from the pool and records a vkCmdCopyBuffer from staging to device local (it might be a good idea to load several assets at once, and use multiple copy commands per command buffer, otherwise you'll have a ton of submits and individual command buffers which is bad for performance, but I'm not sure how engines tend to structure this)

That raises a good question for implementation, regardless of where this is implemented - how expensive is submit? Expensive enough on transfer queues that minimizing submit operations is worth it?

There are at least two ways to approach this:

Simple way:

Application makes a request to put a texture into the GPU. Multiple threads may be making such requests concurrently.
GPU Buffer is allocated for texture.
Buffer is passed to transfer queue feeding thread (needed because one thread per command queue limit).
Application thread is blocked waiting for completion.
Transfer queue feeding thread submits everything on its queue to GPU as one Submit.
Transfer queue feeding thread fences and waits for completion callbacks.
On completion, application thread is unblocked and gets a fully loaded handle to the asset in the GPU.

This depends on Submit being reasonably fast compared to, say, loading a 1MB texture.

Complicated way

Application makes a request to put a texture into the GPU. Multiple threads may be making such requests concurrently.
GPU Buffer is allocated for texture.
Buffer is passed to transfer queue feeding thread (needed because one thread per command queue limit).
Application thread gets control back immediately, with WGPU's version of a future referencing a buffer not yet loaded into the GPU.
Application thread is free to use its "handle" in a render request. WGPU interlocking (?) prevents use of the asset before it is loaded.
Transfer queue feeding thread submits everything on its queue to GPU as one Submit.
Transfer queue feeding thread fences and waits for completion callbacks.
Completion callbacks are (somehow) fed to WGPU interlocking system.

This potentially has higher performance, especially for single-thread programs where the asset loading requests come from the same thread that does renders. Unclear if the added complexity is worth it.

I'd be fine with the simple approach, unless Submit is really slow.

gfx-rs / wgpu

Add support for transfer queues so assets can be loaded concurrently with rendering. #5576