RobertBeckebans / RBDOOM-3-BFG

Doom 3 BFG Edition source port with updated DX12 / Vulkan renderer and modern game engine features
https://www.moddb.com/mods/rbdoom-3-bfg
GNU General Public License v3.0
1.37k stars 244 forks source link

Divide large constant buffer into subsets and implement push constants for performance #855

Open SRSaunders opened 3 months ago

SRSaunders commented 3 months ago

This PR replaces the performance part of https://github.com/RobertBeckebans/RBDOOM-3-BFG/pull/818, which will be closed and not merged. It has one dependency on nvrhi changes: https://github.com/RobertBeckebans/nvrhi/pull/6 for relaxing nvrhi push constant limits to permit platform-specific runtime checks. (UPDATE: dependency now merged into nvrhi)

This fixes the performance part of #763.

Details are as follows:

  1. Separates the single large constant buffer into renderparm subsets (12 in total: 3 of 128 bytes in size, 6 greater than 128 bytes but less than 256 bytes, and 3 greater than 256 bytes but less than 1024 bytes).
  2. Adds new binding layout types to associate and differentiate between the new subsets (BINDING_LAYOUT_GBUFFER, BINDING_LAYOUT_GBUFFER_SKINNED, BINDING_LAYOUT_TEXTURE, BINDING_LAYOUT_TEXTURE_SKINNED, BINDING_LAYOUT_WOBBLESKY, BINDING_LAYOUT_SSGI, BINDING_LAYOUT_SSGI_SKINNED, BINDING_LAYOUT_POST_PROCESS).
  3. Implements push constants for Vulkan and DX12 across all platforms: Linux, macOS, Windows. This has varying degrees of performance improvement, the largest being on Vulkan for Linux and macOS. Windows Vulkan shows modest improvement dependant on the GPU vendor (Nvidia's 256 byte limit is better than AMD's 128 byte limit on Windows). Windows DX12 shows similar performance when using push constants vs. volatile constant buffers. However, DX12 does show a performance improvement due to constant buffer subsetting with better change detection logic. I have defined a new boolean r_useDX12PushConstants cvar which is turned off by default. This can optionally be turned on using autoexec.cfg for experimentation.
  4. Reduced the volatile constant max buffer count from 16,384 to 8,192. I believe this is sufficient but if testing reveals differently, then it could be boosted back up. Note that when push constants are enabled it reduces the requirement.
  5. Adds basic infrastructure for static constant buffers but these are disabled for now. This could be a possibility for the future but further subset refactoring would likely be needed, and sync issues would have to be resolved.
  6. Modified uniforms change detection logic (orthogonal to push constants) which has a very significant positive impact on performance. See performance timings below. Also added new cvar r_useVulkanPushConstants (default on) which is useful for performance comparisons.
  7. Don't allocate constant buffers unless required (i.e. when push constants disabled for a given binding layout type).
  8. Fixed ImmediateMode (mainly for debug tools) to work with push constants enabled.
  9. Additional Info: One other benefit of these changes is a reduction in Vulkan GPU memory usage. For example, on the Marine HQ hallway scene, GPU memory usage is reduced from ~2400 MB to ~1800 MB, or about 25%. The benefits come from three areas (from largest to smallest impact): a) splitting up the single large uniforms buffer into smaller subsets which avoids duplication of unused renderparms memory for each binding layout type, b) reducing the number of volatile constant buffer copies required for each binding layout type, and c) enabling push constants. For DX12, interestingly there appears to be no difference in GPU memory usage (~1900 MB) before and after this PR.

Tested on Windows 10 (AMD and Nvidia), Linux Manjaro, and macOS Ventura 13.5

Performance timings for this PR vs. current master, generated using a simple home-made timedemo:

Windows Nvidia System (1070 Ti) DX12: 263 fps before, 360 fps after (with r_useDX12PushConstants = 0) --> significant improvement Vulkan: 218 fps before, 333 fps after --> significant improvement

Windows AMD System (6600 XT) DX12: 295 fps before, 305 fps after (with r_useDX12PushConstants = 0) --> neutral/positive improvement Vulkan: 150 fps before, 160 fps after --> neutral/positive improvement

Linux AMD System (6600 XT) Vulkan: 150 fps before, 270 fps after --> large improvement

macOS AMD System (6600 XT) Vulkan: 77 fps before, 245 fps after --> very large improvement

macOS Apple Silicon System (M1 Air) Vulkan: 6 fps before, 85 fps after --> massive improvement

See on-screen HUD statistics (FPS, GPU Memory, CPU/GPU Relative Usage % for com_fixedTic = 1) in the following screenshots showing the independent impact of: a) uniforms buffer subsetting, and b) push constants.

Notes re test setup:

  1. For my specific h/w setup (Intel Core i7 + AMD 6600XT GPU) macOS saturates on the CPU before the GPU. This can happen due to Vulkan to Metal encoding which occurs only after the renderer backend is finished. If this takes longer than the GPU to finish its work, then the frame must wait for completion and frame sync times will increase. For other hardware setups including Apple Silicon, this may not be the case and the GPU may saturate before the CPU. If the extra encoding step was not required on macOS, FPS numbers would likely be closer to linux where Vulkan is native.
  2. Note all tests were done at 1920x1080 except for Windows DX12, which was at 2560x1440. This was necessary to show DX12 benchmark differences before and after this PR. Otherwise FPS rates would be CPU-limited (to around 250 fps for this scene) and no differences would be observable. As a result, you cannot directly compare DX12 frame rates with Vulkan frame rates, but only relative before-and-after differences.

macOS Vulkan: Baseline using current master + PR #854 but without this PR: rbdoom-3-bfg-20240130-092353-001

macOS Vulkan: Impact of uniforms buffer subsetting with push constants disabled: rbdoom-3-bfg-20240130-093651-001

macOS Vulkan: Impact of uniforms buffer subsetting with push constants enabled: rbdoom-3-bfg-20240130-161150-002 fixedTic

linux Vulkan: Baseline using current master + PR #854 but without this PR: rbdoom-3-bfg-20240130-125300-003

linux Vulkan: Impact of uniforms buffer subsetting with push constants disabled: rbdoom-3-bfg-20240130-161807-002 fixedTic

linux Vulkan: Impact of uniforms buffer subsetting with push constants enabled: rbdoom-3-bfg-20240130-162011-002 fixedTic

Windows Vulkan: Baseline using current master + PR #854 but without this PR: rbdoom-3-bfg-20240130-180511-001

Windows Vulkan: Impact of uniforms buffer subsetting with push constants disabled: rbdoom-3-bfg-20240130-180959-001

Windows Vulkan: Impact of uniforms buffer subsetting with push constants enabled: rbdoom-3-bfg-20240130-181205-002

Windows DX12: Baseline using current master + PR #854 but without this PR: rbdoom-3-bfg-20240130-182409-002

Windows DX12: Impact of uniforms buffer subsetting with push constants disabled: rbdoom-3-bfg-20240130-182731-002

Windows DX12: Impact of uniforms buffer subsetting with push constants enabled: rbdoom-3-bfg-20240130-182847-003

SRSaunders commented 2 months ago

Now updated to your latest master with retro rendering modes and crt post processing. The merge was large given the number of changes to master, but fairly straight forward with respect to the new shaders. Fortunately they all fell into the BINDING_LAYOUT_POST_PROCESS_FINAL binding layout type with its existing renderparm subset. I did not have to make any changes to the mappings, just simply mod the new shaders to add the pc. preface and include the correct subset.

Here is the updated mapping spreadsheet: Binding to Shader Mapping v5.xlsx

At first I was not sure about the new modes, but now I really like the PSX + Newpixie CRT setting. Feels very 90s!

RobertBeckebans commented 2 months ago

Yeah the retro modes are something I had in mind for a longer time because there are enough people who lurk into this engine for indie dev and I personally grew up as a kid with the C64, CPC 6128 and Amiga 600. I also thought about forking this project and doing the retro there just for standalone development but everyone is focused on RBDOOM and I don't want to do the marketing and backporting of new features. The PSX rendering isn't done yet. The PSX branch also introduces a new renderparm which is used among many shaders for the vertex snapping and texture warping effects. They will be optional by r_psx* cvars but still be needed for a faithful PSX fake experience.

SRSaunders commented 2 months ago

Thanks for the heads up re your PSX branch and a new renderparm. Is there any way I could have a look at that to understand the implications for this branch and future subset handing?

SRSaunders commented 2 months ago

Updated to be compatible with nvrhi + ShaderMake rebase.

Still depends on https://github.com/RobertBeckebans/nvrhi/pull/6. (this dependency now merged into nvrhi)

SRSaunders commented 1 month ago

r_useVulkanPushConstants renamed to r_vkUsePushConstants on merge. r_useDX12PushConstants renamed to r_dxUsePushConstants on merge.

Also set ShaderMake retryCount=20 to recover from any macOS/linux shell failures during parallel SPIRV compilation. Increased from default retryCount=10 to handle doubling of shader combinations due to USE_PUSH_CONSTANTS.