Skip staging buffers when using Queue::write_buffer on supported GPUs

JMS55 commented 1 year ago

Background

Currently, wgpu keeps around a pool of staging buffers that are DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_COHERENT_BIT. Writes to GPU buffers are first written here, and then copied to the final GPU buffers. This is because systems are sometimes limited by the 256MB PCI-E limit, meaning the only DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_COHERENT_BIT heap you had was 256 MB.

However, for mobile GPUs, integrated GPUs, and now desktops with resizable bar, this limitation is removed. These devices have heaps with DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_COHERENT_BIT that range into the GBs. There's no need for staging buffers in this case.

Feature Request

wgpu should check the size of the DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT | HOST_COHERENT_BIT it uses for buffer writes, and if it's greater than 256 MB, skip creating staging buffers. wgpu will then just directly write to buffers.

John-Nagle commented 6 months ago

This is great! The general idea is to get assets into the GPU faster. There are several approaches.

I started a discussion on loading assets into the GPU while rendering. I was less concerned about avoiding the extra copy than about getting the upload out of the render thread. I use Rend3/WGPU, and upload all assets from asset upload threads. But, since WGPU only has one command queue, Rend3 just puts all that bulk work on the main rendering command queue, where it's done by the rendering thread just before rendering each frame. So I'm plugging for a WGPU level transfer queue, on platforms where the hardware supports it.

Is it possible to do both? Eliminate the extra copy and do all the work outside the render thread?

I get frame rate hits that take rendering down from 60 FPS to 10 FPS due to asset upload overhead in the main thread. So this is a very real issue for me.

JMS55 commented 6 months ago

There's a whole bunch of considerations around asset uploading. The ideal, modern ways is as follows from what I understand:

Use a dedicated CPU thread [pool] to handle asset data
Allocate a buffer, and write asset data streamed from disk directly into the buffer[1*]
Done[2*]

However, if you don't have UMA (i.e. integrated GPU) or Resizable BAR, then it gets complicated.

You instead need 1 (staging) buffer that you map to be CPU visible, write to, unmap, and then copy to a second (final) buffer. That copy has to be done using a command buffer submitted to a queue. Using multiple queues is a bit tricky, and wgpu has no supports for them at the moment, as you need have synchronization points between the queues.

For best performance, you want to:

Batch many copy commands (assets) into a single command encoder
Use a secondary transfer/copy queue (not the main one you're submitting your graphics work to)
Submit to the queue early on in the frame, so that the data can start uploading while the CPU is busy recording commands for your actual rendering work, and be ready by the time those commands get submitted.

Textures are also a bit more complicated iirc, as you need to copy from a staging buffer to texture.

A good read I found is: https://therealmjp.github.io/posts/gpu-memory-pool

So, lets talk about platform support for the modern path.

Vulkan you can check for a heap with DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT (I don't think HOST_COHERENT_BIT is actually necessary) > 256mb / ~equal to the overall available memory size.
DirectX12 you can use Upload Heaps (stabilized just recently as part of the Agility SDK!).
- https://gpuopen.com/learn/using-d3d12-heap-type-gpu-upload
- https://developer.nvidia.com/blog/optimizing-dx12-resource-uploads-to-the-gpu-using-gpu-upload-heaps
Metal I have no idea

[1]: For stuff like compressed textures, the CPU instead needs to load the texture data to a CPU buffer, and then decompress into the GPU buffer, and then copy to the actual texture, unless you go even fancier and use something like direct storage.

[2]: When using buffers that you would upload to once a frame, unlike a once and done asset upload, it gets slightly more tricky. You need to make sure not to overwrite data the GPU is currently reading. Easiest way is to wait for the previous frame to finish rendering entirely (synchronize with the GPU) before starting the current frame's writes. More fine-grained schemes could be done, such as putting the fence after your main rendering pass, but before post processing, allowing the CPU to start uploading the next frame's data sooner (if the game simulation is done early enough).

[3]: This info is from what I understand from reading stuff online. I don't have actual experience using these APIs. I definitely could be wrong somewhere.

John-Nagle commented 6 months ago

Using multiple queues is a bit tricky, and wgpu has no supports for them at the moment, as you need have synchronization points between the queues.

That's an API issue. The Rend3 API solves that problem. The primitives at that level are "add_mesh" and "add_2d_texture". Those just move data into the GPU and return a handle for later use. You can't use the asset until you have a handle. Handles are Rust Arcs, so when all the uses go away, so does the handle, which releases the space in the GPU. Any thread can make those calls, and this approach is Rust thread-safe.

But to make that work on top of WGPU's single-queue architecture, all those requests go onto a work queue processed by the single render thread, So there's no real concurrent asset loading into the GPU at this time.

So having a "load this asset into GPU" primitive, callable from any thread, targeted only at a buffer not already in use for something, and done in whatever way the platform supports, ought to basically be safe.

As for queue support, Vulkan, DX12, and Metal all support multiple command queues. You're not guaranteed that the underlying hardware has more than one queue, though.

gfx-rs / wgpu

Skip staging buffers when using Queue::write_buffer on supported GPUs #3698

Background

Feature Request