Divide large constant buffer into subsets and implement push constants for Vulkan performance

SRSaunders commented 7 months ago

This PR depends on https://github.com/RobertBeckebans/nvrhi/pull/6

This PR also solves #763 for Apple Silicon performance and rendering artifact elimination, as well as #804 for Intel Integrated GPU support with Vulkan (slow, but works).

This is a fairly large pull request that implements the following:

Separates the single large constant buffer into renderparm subsets (12 in total: 3 of 128 bytes in size, 6 greater than 128 bytes but less than 256 bytes, and 3 greater than 256 bytes but less than 1024 bytes)
Adds new binding layout types to associate and differentiate between the new subsets (BINDING_LAYOUT_GBUFFER, BINDING_LAYOUT_GBUFFER_SKINNED, BINDING_LAYOUT_TEXTURE, BINDING_LAYOUT_TEXTURE_SKINNED, BINDING_LAYOUT_WOBBLESKY, BINDING_LAYOUT_SSGI, BINDING_LAYOUT_SSGI_SKINNED, BINDING_LAYOUT_POST_PROCESS)
Implements push constants for Vulkan and DX12 across all platforms: Linux, macOS, Windows. This has varying degrees of performance improvement, the largest being on Vulkan for Linux and macOS. Windows Vulkan shows modest improvement dependant on the GPU vendor (Nvidia's 256 byte limit is better than AMD's 128 byte limit on Windows). Windows DX12 shows no performance improvement when using push constants vs. volatile constant buffers. Given this I have defined a new boolean r_useDX12PushConstants cvar which is turned off by default. This can optionally be turned on using autoexec.cfg for experimentation.
Reduced the volatile constant max buffer count from 16,384 to 8,192. I believe this is sufficient but if testing reveals differently, then it could be boosted back up. Note that when push constants are enabled it reduces the requirement.
Adds basic infrastructure for static constant buffers but these are disabled for now. This could be a possibility for the future but further subset refactoring would likely be needed, and sync issues would have to be resolved.
Simplified Vulkan code by removing the barrier command list - not needed if command submit is done in correct order
Added CPU and GPU usage % counters to the on-screen HUD display for all platforms
Added MoltenVK's Vulkan-to-Metal encoding time to the HUD when available for macOS only. This capability will be available in the next release of MoltenVK (1.2.6)
Addressed a nasty bug with mis-calibrated GPU timers for macOS caused by a regression in the MoltenVK 1.2.5 release. Use either MoltenVK release <= 1.2.4 or the coming 1.2.6 release with this pull request to avoid the issue.
Modified the ssao_compute fix to support multiple HLSL versions, not just 2021. I found that my changes were required for compiling DXIL for D3D12 while remaining compatible with SPIRV for Vulkan, otherwise I would get compile failures on Windows DX12 (select does not seem to be supported when compiling for DXIL).
A minor CMakeLists fix primarily for Xcode that cleans up precompiled.h-xxxxx.gch.tmp files left around when the ZERO_CHECK target runs for regeneration.
UPDATED: Tested on Apple Silicon (M1) and disabled GPU Skinning for macOS arm64 to eliminate rendering artifacts. I will have to look at this later to see what is going on with GPU skinning on Apple Silicon.
UPDATED: Modified cmake-macos-*.sh and cmake-xcode-*.sh build scripts for openal-soft path portability across x86 and Apple Silicon. Thanks to @asemarafa for the code.
UPDATED: Works around missing Vulkan shaderStorageImageReadWithoutFormat device feature on Intel GPUs, and individually activates VK_KHR_fragment_shading_rate sub-features vs. all or none (this is supported by nvrhi).
FINAL UPDATE: Fixed uniforms change detection logic (orthogonal to push constants) which has a very significant positive impact on performance. See updated performance timings below. Also added new cvar r_useVulkanPushConstants (default on) which is useful for performance comparisons.

Tested on Windows 10 (AMD and Nvidia), Linux Manjaro, and macOS Ventura 13.5

Performance timings for this PR vs. current master, generated using a simple home-made timedemo:

Windows Nvidia System (1070 Ti) DX12: 263 fps before, ~~255~~ 360 fps after (with r_useDX12PushConstants = 0) --> significant improvement Vulkan: 218 fps before, ~~233~~ 333 fps after --> significant improvement

Windows AMD System (6600 XT) DX12: 295 fps before, ~~285~~ 305 fps after (with r_useDX12PushConstants = 0) --> neutral/positive improvement Vulkan: 150 fps before, ~~155~~ 160 fps after --> neutral/positive improvement

Linux AMD System (6600 XT) Vulkan: 150 fps before, ~~210~~ 270 fps after --> large improvement

macOS AMD System (6600 XT) Vulkan: 77 fps before, ~~177~~ 245 fps after --> very large improvement

macOS Apple Silicon System (M1 Air) (new) Vulkan: 6 fps before, 85 fps after --> massive improvement

Some possible explanations are as follows:

Windows nvrhi DX12 push constants are implemented as D3D12 root constants which are just a special constant buffer of limited size (about 240 bytes for RBDoom3BFG). I suspect there is no real performance gain by using these over nvrhi's volatile constant buffers which appear to be quite efficient with low synchronization overhead on Windows. As a result r_useDX12PushConstants = 0 by default.
Nvidia offers 256 bytes of Vulkan push constants on Windows, while AMD offers only 128 bytes of Vulkan push constants on Windows. For this reason AMD Vulkan users on Windows will likely see a small positive change in performance. Nvidia users on Windows will see a greater ~~but still modest~~ improvement for Vulkan.
On Linux both Nvidia and AMD can offer 256 bytes of Vulkan push constants. Perhaps AMD's driver implementation on Linux favours push constants due to reduced synchronization overhead and the performance gains are quite significant. I would be interested in seeing some perf timings for Nvidia on Linux.
On macOS/MoltenVK/Metal there are 4096 bytes of push constants. This means that virtually all of the draw transactions are handled with push constants and volatile constant buffers are not needed. This is a good thing since constant buffer synchronization on macOS/MoltenVK/Metal appears to be expensive. Push constants do not incur any synchronization overheads and are delivered with command submission.
Overall I think this is a useful change, especially for Vulkan users on Linux and macOS. I would go as far as saying it is mandatory for macOS on Apple Silicon, as the game is virtually unplayable without it. The Windows benefits are ~~less, but still in the positive direction for Vulkan~~ significant for DX12 & Vulkan on Nvidia, but less but so on AMD, at least for my test systems/GPUs.

@RobertBeckebans here is the spreadsheet I used to create the binding layout type to shader/renderparm subset mapping: Binding to Shader Mapping v4.xlsx

RobertBeckebans commented 7 months ago

The results are awesome but I need quite a lot of testing and time to review this code. I aim for a new public release in December before Christmas.

SRSaunders commented 7 months ago

Thanks @RobertBeckebans - take your time on review. It took me a while to figure this out anyways, plus a bunch of work with the MoltenVK project to diagnose and optimize performance issues. It was my summer/fall background project :)

If you have questions please don't hesitate to post a message or send an email. I will do my best to answer.

SRSaunders commented 6 months ago

Hi @RobertBeckebans. I have pushed my final design update to this branch. Commits https://github.com/RobertBeckebans/RBDOOM-3-BFG/pull/818/commits/2063c72912a76f07f545dddaa806e0e7869d0b1c and https://github.com/RobertBeckebans/RBDOOM-3-BFG/pull/818/commits/c1f712a3d66ee6177c7eca518034c960b8bc97f2 add one small but critical update for performance - fixing the uniforms change detection logic which reduces how often volatile constant buffers are written for the larger renderparm sets. This has a huge effect which is orthogonal to push constants, and has a positive impact across all platforms: Windows, Linux, & macOS. See updated timing info above.

I am not sure why my AMD 6600XT GPU seems to be limited when running Windows/Vulkan, with other OSs surpassing it with the same card. Perhaps something about AMD’s windows drivers or AMD’s 128 byte limit for Vulkan push constants on Windows.

SRSaunders commented 5 months ago

I added a few minor things due to dependencies on previous changes within this PR:

Added a few additional comments to explain some of the important changes in this PR
Don't allocate constant buffers unless required (i.e. when push constants disabled for binding layout type)
For the statistics overlay HUD, I added GPU Memory usage and smoothing to CPU/GPU usage percentages
Made the VMA header file visible within the IDE source tree under libs/vma (CMakeLists change)

SRSaunders commented 5 months ago

Oops, just realized that ImmediateMode was broken for using debug tools with push constants enabled.

Fixed in https://github.com/RobertBeckebans/RBDOOM-3-BFG/pull/818/commits/ee3b6f9b2f0754cf0bcaf47a2d124f2c337defe1

SRSaunders commented 4 months ago

macOS only: I just added support for the new VK_EXT_layer_settings extension used for configuring the upcoming MoltenVK 1.2.7 / Vulkan SDK 1.3.272.0. See commit https://github.com/RobertBeckebans/RBDOOM-3-BFG/pull/818/commits/f3c65eef3e75ea2846cffd97aff1fd2f9bf92f7f.

@RobertBeckebans The last several commits have been added mostly due to dependencies on earlier changes within this PR. Unless something comes up in your testing, I don't plan any more changes here.

labrnth commented 4 months ago

@SRSaunders I tried building the XCode release using this branch but ran into a couple of errors. The following diff fixed the XCode build:

diff --git a/neo/renderer/GuiModel.cpp b/neo/renderer/GuiModel.cpp
index 5db04303..fb4547b5 100644
--- a/neo/renderer/GuiModel.cpp
+++ b/neo/renderer/GuiModel.cpp
@@ -395,9 +395,9 @@ void idGuiModel::EmitImGui( ImDrawData* drawData )
                        idScreenRect clipRect =
                        {
                                static_cast<short>( pcmd->ClipRect.x ),
-                               io.DisplaySize.y - static_cast<short>( pcmd->ClipRect.w ),
+                static_cast<short>(io.DisplaySize.y - static_cast<short>( pcmd->ClipRect.w )),
                                static_cast<short>( pcmd->ClipRect.z ),
-                               io.DisplaySize.y - static_cast<short>( pcmd->ClipRect.y ),
+                static_cast<short>(io.DisplaySize.y - static_cast<short>( pcmd->ClipRect.y )),
                                0.0f,
                                1.0f
                        };

labrnth commented 4 months ago

Also noticed some strange artifacts on the weapons (not seeing this elsewhere). See attached screenshot. rbdoom-3-bfg-20240101-111545-001

I have observed a massive increase in FPS though, so thats great.

SRSaunders commented 4 months ago

@labrnth Thanks for trying out this PR. The compile issue you found was fixed in master with commit ab663a76. Perhaps you synced before this was merged recently.

Regarding the weapons artifacts, are you running on an Intel or Apple Silicon Mac? I thought this problem was only visible on Apple Silicon - where I disabled GPU skinning to fix it. If you are running on Intel that is quite interesting and may indicate a more general problem with skinning. In this case you could try setting r_useGPUSkinning 0 in your console, or adding set r_useGPUSkinning 0 to your autoexec.cfg file. Please let me know your results.

Good news on the performance improvement. That was my main goal for this effort. I trust you applied https://github.com/RobertBeckebans/nvrhi/pull/6 - this is needed for max performance.

labrnth commented 4 months ago

@SRSaunders Ya that could be.

Yes I'm running on Apple Silicon (M2 Pro). The r_useGPUSkinning "0" fixed it but only while in campaign not the multiplayer.

SRSaunders commented 4 months ago

Thanks for catching the r_useGPUSkinning reset in multiplayer. Solved in https://github.com/RobertBeckebans/RBDOOM-3-BFG/pull/818/commits/3b6598b88d71bc443c431a04e0f075410679ea36 (this cvar is not a cheat in any case).

SRSaunders commented 3 months ago

@RobertBeckebans I would like to know what you are planning with this PR. I have a few other improvements for Vulkan that I would like to submit (e.g. use Vulkan dynamic functions vs. static linkage for VMA setup, Optick, and MoltenVK config), but I don't want to keep adding on here, especially if you are not planning to merge in the near term.

I could simplify things by dividing this into two PRs that could be treated independently:

Separate the Uniforms Subsetting, Push Constants and Uniforms Change Detection improvements which is focussed solely on performance for both Vulkan and DX12. And yes, DX12 benefits from uniforms subsetting with the new change detection logic even without push constants enabled.
Separate the HUD GPU Memory usage & CPU/GPU Usage % features (DX12 and Vulkan), generic Vulkan changes (incl. Intel iGPU fixes, dynamic functions, etc), and macOS/MoltenVK-only stuff.

I would then close this large PR and you could look at two new smaller PRs independently.

Would this help? If so would you look at either of the new PRs for your upcoming release?

RobertBeckebans commented 3 months ago

I can see why it is necessary to split the constant buffer renderparms into smaller push constants but I don't really like the complexity it comes with. I don't see this PR merged with RBDoom 1.6 but rather in 1.7. I also would like to merge this after NVRHI has been updated to the newest version.

Separate the HUD GPU Memory usage & CPU/GPU Usage % features (DX12 and Vulkan), generic Vulkan changes (incl. Intel iGPU fixes, dynamic functions, etc), and macOS/MoltenVK-only stuff.

I also would like to see this separated out into a different branch and then it needs to be merged with the newest NVRHI later.

SRSaunders commented 3 months ago

Thanks @RobertBeckebans for your comments and advice. I will proceed with splitting the PR into two parts and resubmit.

Regarding your comments:

For the push constants work, I am fine with you waiting until 1.7 and merging after updating nvrhi. However, please note the nvrhi dependency is very small, and only requires the nvrhi push constant limits to updated based on platform (i.e. https://github.com/RobertBeckebans/nvrhi/pull/6)
Regarding your comments about complexity, I found the main issue was determining the shader -> subset mapping and then encoding that information into data structures for use at runtime. However, once that work was done, it is relatively static and no changes are required unless shaders are updated in the future. Another factor is that I wrote the code to allow runtime choice of constant buffers vs push constants based on platform capability and cvars - that adds some complexity that I suppose could be reduced if it was hard-coded. And lastly, the renderparm change detection logic is pretty simple, but yields large improvements independent of push constants themselves. I wish I did not have to touch the renderparm naming in the shaders, but as far as I can tell you cannot use globals for push constants and they need to be part of a struct. I also wanted to retain constant buffer compatibility if push constants were disabled at runtime. So I had to do a global rename using the pc. struct preface. If you know a better solution to this please let me know.
Regarding the other HUD, Vulkan, and macOS stuff it works with current master has no nvrhi version dependencies.

RobertBeckebans commented 3 months ago

The pc. suffix for the renderparms is not really a big deal. I was refering to figuring out which renderparms can be put into subsets for the used shaders and hitting the limits of constant buffers when extending the shaders in the future. You did a good job at sorting those using the excel sheet but this still needs to go through some deeper testing and it is also kind of the opposite I generally wish for the overall renderer design. I like it stupid simple and it might be the case that changing the renderer from a multipass forward design where each geometry is rerendered for each light to a clustered forward+ renderer design can outperform your changes while having a design that still works with the old renderparms array. Having all shadows in a big shadow atlas like we have right now was mandatory for forward+ and it is a feature I want to try to implement this year.

SRSaunders commented 3 months ago

Yes, the sorting work with Excel was the main effort, along with figuring out how to dynamically enable/disable push constants at runtime based on platform capabilities. This branch has been my daily driver for about 4 months now, plus a number of linux and macOS users have tried the PR with success. While I think it's pretty stable as is, getting more play and stress testing would give us more confidence.

Regarding your plans for clustered forward+ rendering, are these the papers that descibe it? forward_plus.pdf 087-096.pdf

If so I am wondering whether my work with push constants can co-exist with these improvements? I presume the main issue would be whether existing shaders would need more parameters (or perhaps new shaders), and then bumping up against push constant platform limits (128 or 256 byte limit on Windows & Linux depending on GPU, a non-issue on macOS). However, if it only involves modifying the lighting shaders (ambient, interaction, and interactionSM*) then the solution should be fairly simple: those shaders already use >256 renderparm bytes and have push constants disabled for their associated binding layout types on Windows and Linux. On macOS they can use push constants since the limit is 4096 bytes on that platform. In other words, adding new renderparms to the three subsets for those specific lighting shader groups would not negate the performance gains from my push constants work. Is this a possiblity?

If you are going to continue to support Linux and macOS, Vulkan cannot be left to fall behind in performance. My work was a small attempt to push things forward (sorry for the pun).

SRSaunders commented 3 months ago

This PR will now be closed unmerged. For existing users please see #854 and #855 for up-to-date replacements.

RobertBeckebans / RBDOOM-3-BFG

Divide large constant buffer into subsets and implement push constants for Vulkan performance #818