RobertBeckebans / RBDOOM-3-BFG

Doom 3 BFG Edition source port with updated DX12 / Vulkan renderer and modern game engine features
https://www.moddb.com/mods/rbdoom-3-bfg
GNU General Public License v3.0
1.38k stars 247 forks source link

Optimize Windows DX12 frame latency via DXGI waitable swap chain #784

Closed SRSaunders closed 1 month ago

SRSaunders commented 11 months ago

This PR attempts to reduce frame latency for Windows DX12 by leveraging DXGI's waitable swap chain option. Note this is not applicable to Vulkan. I found some interesting articles and sample code on the subject:

  1. https://www.intel.com/content/www/us/en/developer/articles/code-sample/sample-application-for-direct3d-12-flip-model-swap-chains.html
  2. https://jackmin.home.blog/2018/12/14/swapchains-present-and-present-latency/

This concept was very easy to implement for RBDoom3BFG and leads to some pretty good results on my AMD 6600XT card driving a 60 HZ monitor, with VSync set on:

  1. Baseline queue submit to present time latency = ~53ms @ 60Hz for GPU triple buffering (NUM_FRAME_DATA = 3) coupled with 3 swap chain buffers (swapChainBufferCount = 3)
  2. Using waitable swap chains, you can program the desired latency at runtime to be from 1 to 3 frames. Using the same GPU triple buffering with 3 swap chain buffers, the resulting latency is about 36 ms @ 60Hz for 2 frames of latency, and 24 ms @ 60Hz for 1 frame of latency. These numbers would be lower if using a 120Hz or 144Hz monitor.
  3. Note these vsync-on latency values pale in comparison to using vsync off (i.e. tearing enabled) with latency numbers in the 3-6 ms range. However, the trade-off is reduced image quality.

The above latency numbers were observed in windowed-mode using PresentMon in conjunction with the Optick profiler improvements from my previous pull request #780. Borderless fullscreen mode gave similar results.

The tradeoff for lower latency is reduced GPU to CPU overlap and FPS throughput. However, if using a powerful GPU this may not be very noticeable. Even with my relatively low-powered 6600 XT, I can easily drive the game in the headquarters hallway scene at 60Hz with the DXGI waitable object set to 1 or 2 frames of latency. However, I am not sure how things would perform with only 1 frame of latency (effectively no CPU/GPU parallelism) during heavy action sequences. For this reason I am recommending a frame latency of 2 to achieve some CPU/GPU parallelism, mirroring the recommendations in article 1 above. In my experiments with RBDoom3BFG, I still found it important to keep NUM_FRAME_DATA = 3 for best performance, coupled with the waitable swap chain set to 2 frames of latency.

I have defined a new cvar, r_maxFrameLatency with default value of 2, to allow simple changes and experimentation. Permitted values are 0, 1, 2, and 3. The value of 0 turns off the feature. Values 1 and 2 are useful and correspond to the number of queued back buffers permitted in the swap chain. A value of 3 means using the full set of 3 swap chain buffers (same latency as off). Note this cvar cannot be changed on the fly (CVAR_INIT), and must be set up when the swap chain is first initialized. To change it you must use the autoexec.cfg file or enter seta r_maxFrameLatency <x> in the console and restart the game.

Here are a couple of Optick screen grabs to show the results. You can see that the VSync/Present queue is now labeled with the FrameID, allowing direct inspection of latency:

Capture 1 showing ~52 msec of latency (60Hz, vsync on, r_maxFrameLatency = 0, same as current app):

Windows DX12 Windowed Present Buffers=3

Capture 2 showing ~36 msec of latency (60Hz, vsync on, r_maxFrameLatency = 2):

Windows DX12 Windowed Present Buffers=3,FL=2

Capture 3 showing ~24 msec of latency (60Hz, vsync on, r_maxFrameLatency = 1):

Windows DX12 Windowed Present Buffers=3,FL=1

Capture 4 showing the difference between the swapchain waitable object or "DX12_Sync1", and GPU triple buffering sync or "DX12_Sync3" (60 Hz, vsync on, r_maxFrameLatency = 3). The "DX12_Sync1" point aligns with the start of a new present frame, and "DX12_Sync3" aligns with completion of the N-2 frame's GPU frame. Depending on the settings and game circumstance, both may be observable as seen below:

Windows DX12 Windowed Sync1+Sync3, FL=2

Calinou commented 11 months ago

Can a double-buffered V-Sync option be added to minimize input lag? Depending on your settings, you may be able to reach the maximum FPS at all times. In this situation, there's not much reason to use triple-buffered V-Sync.

SRSaunders commented 11 months ago

The short answer is yes, the capability is already there. There are two mechanisms to deliver double-buffering (or equivalent latency):

  1. Set NUM_FRAME_DATA = 2 (in precompiled.h) and swapChainBufferCount = 2 (in DeviceManager.h) and recompile the current code base (i.e. without this PR). This will reduce latency to about 20-24 msec @ 60Hz, but your FPS results will depend entirely on how fast a GPU you have. With only 2 GPU frame data slots and 2 swap chain buffers, CPU/GPU processing is serialized and throughput is reduced. For instance, with my RX6600 XT, I can no longer achieve 60 FPS (more like 40-45 FPS) in the benchmark headquarters hallway scene, whereas it easily surpasses 60 FPS (more like 200+ FPS unlocked with vsync off) when NUM_FRAME_DATA = 3 and swapChainBufferCount = 3 (i.e. triple buffering, with better CPU/GPU parallelism).
  2. Alternatively, with this PR (if merged), you can leave NUM_FRAME_DATA = 3 and swapChainBufferCount = 3 for higher throughput and FPS, but set r_maxFrameLatency = 2 (the default) to achieve a latency of about 36 msec @ 60Hz, or r_maxFrameLatency = 1 for a latency of 20-24 msec @ 60 Hz. However, setting r_maxFrameLatency = 1 eliminates CPU/GPU parallelism like in option 1 above and the ability to handle processing spikes without FPS drops will be reduced. However, at least for my setup, this is better than option 1 above since I can maintain higher FPS rates for equivalent latency settings. I am recommending a setting of r_maxFrameLatency = 2 for a balanced approach to performance and latency. Your mileage may vary based on the strength of your GPU and display refresh rate.

If you want to experiment, I suggest you try out this PR with PresentMon and provide feedback on your results. I would be very interested in how this works for other setups. Alternatively, you could try out the Intel sample application in link 1 of my PR write-up above. I found this interactive latency demo very helpful in understanding the tradeoffs.