gfx-rs / wgpu

A cross-platform, safe, pure-Rust graphics API.
https://wgpu.rs
Apache License 2.0
12.12k stars 880 forks source link

Add support for transfer queues so assets can be loaded concurrently with rendering. #5576

Open John-Nagle opened 4 months ago

John-Nagle commented 4 months ago

Is your feature request related to a problem? Please describe. Right now, you can only have one command queue to talk to WGPU. The main thread does all the work. This cuts frame rates way down when trying to load assets into the GPU from multiple threads. I can get 60FPS when other threads are not loading assets, but that drops to 10FPS or less when assets are being loaded.

Here's a video of an example. This is my Sharpview metaverse client, which uses Rend3 and WGPU.

https://video.hardlimit.com/w/7usCE3v2RrWK6nuoSr4NHJ

Look at all that highly detailed content. It's not preloaded. Assets are being frantically loaded into the GPU ahead of the player's travels through a huge world. 16 threads are handling the content loading in parallel with rendering. That works fine. But the final step, where the data gets copied into the GPU, is done, by WGPU, in the render thread. This slows the frame rate way down, sometimes below 10FPS.. If the player stays in one place, soon the asset loaders catch up, and the frame rate comes back up to the normal 60FPS.

This affects both Bevy and Rend3.

More discussion at https://github.com/gfx-rs/wgpu/discussions/5525

Describe the solution you'd like Support for transfer queues, as supported by Vulkan, DX12, and Metal. (WebAssembly and Android may not support this yet.)

A transfer-only queue has simple interlocking and is fast, because a data transfer to an empty buffer doesn't depend on anything else happening first. That's easier to implement than multiple render queues. I'd rather have transfer queues working in a few months than multiple render queues working in a few years.

From the Vulkan dev guide:

Data upload is another section that is very often multithreaded. In here, you have a dedicated IO thread that will load assets to disk, and said IO thread will have its own queue and command allocators, hopefully a transfer queue. This way it is possible to upload assets at a speed completely separated from the main frame loop, so if it takes half a second to upload a set of big textures, you don’t have a hitch. To do that, you need to create a transfer or async-compute queue (if available), and dedicate that one to the loader thread. Once you have that, it’s similar to what was commented on the pipeline compiler thread, and you have an IO thread that communicates through a parallel queue with the main simulation loop to upload data in an asynchronous way. Once a transfer has been uploaded, and checked that it has finished with a Fence, then the IO thread can send the info to the main loop, and then the engine can connect the new textures or models into the renderer.

Unreal Engine has been doing that since UE4.

Describe alternatives you've considered

Additional context I have a test case at https://github.com/John-Nagle/render-bench This tests how badly main-thread rendering is impacted by what other threads are doing with assets.

JMS55 commented 4 months ago

Bypassing WGPU and using a transfer queue that WGPU doesn't see.

That's a fairly feasible option in the interm. For e.g. vulkan, you would:

John-Nagle commented 4 months ago

That's an option, but a desperation one. I'd hate to have to go into the innards of WGPU's allocation system. Also, I'd be giving up MacOS support, which is the main point of using WGPU rather than Vulkan directly. I saw a note that the Bevy devs are considering such a bypass. If they do it, I'll have an example to look at.

Because I'm using WGPU via Rend3, I'd need my own version of Rend3, too.

Another alternative is to fork Rend3, rip out the connection to WGPU, and replace that with Vulkano. That would be a relatively clean and safe Rust solution. Rend3 has a well-designed, clean API, and that's worth retaining.

These are all ugly hacks. Better if transfer queues are implemented inside WGPU, where they belong.

John-Nagle commented 4 months ago

Here's the proposed Bevy workaround for this problem: That's a plan to bypass WGPU and go directly from Bevy to Vulkan. Comment from the Bevy issue: "This is a bit hacky, and relying on globals in the form of static OnceLock-ed variables, but may be reasonable until wgpu supports multiple queues."

John-Nagle commented 4 months ago

JMS55 commented: Background thread creates a command buffer from the pool and records a vkCmdCopyBuffer from staging to device local (it might be a good idea to load several assets at once, and use multiple copy commands per command buffer, otherwise you'll have a ton of submits and individual command buffers which is bad for performance, but I'm not sure how engines tend to structure this)

That raises a good question for implementation, regardless of where this is implemented - how expensive is submit? Expensive enough on transfer queues that minimizing submit operations is worth it?

There are at least two ways to approach this:

Simple way:

This depends on Submit being reasonably fast compared to, say, loading a 1MB texture.

Complicated way

This potentially has higher performance, especially for single-thread programs where the asset loading requests come from the same thread that does renders. Unclear if the added complexity is worth it.

I'd be fine with the simple approach, unless Submit is really slow.