Frame sync changes for higher performance & fps

SRSaunders commented 1 year ago

Over the past few weeks I have been doing some profiling using Xcode Instruments to see if I could improve macOS performance / fps. This process allowed me to observe that RBDoom3BFG is largely serializing CPU-side command queue processing and GPU processing. This is especially bad on macOS given that Vulkan commands have to be translated via MoltenVK to Metal commands on the fly, and this overhead makes things worse compared to Windows and Linux where Vulkan is native. The main reason for the serialization is due to the device-level waitForIdle() call inside GL_BlockingSwapBuffers(), where the CPU waits for GPU operations to complete before getting started on the next frame's game thread and command queue processing. This limits the parallel overlap between the CPU and GPU and really slows things down. For vanilla versions of Doom3 this may not be so bad since rendering was fairly simple. However, with RBDoom3BFG rendering is becoming more complex and time consuming, and parallelism is needed.

I noticed that NVRHI offers other modes of sync, most notably command queue event queries. This pull request experiments with two things: a) using triple frame buffering vs. double buffering for NVRHI (DX12 & Vulkan), and b) replacing the device waitForIdle() device sync with a command queue sync operation based on completion of the previous frame's graphics command queue. This should allow the CPU to get going without waiting for GPU idle, and thus enable more parallelism (assuming we are using triple buffering). The results with my AMD 6600XT card are fairly significant as seen below:

Win32 DX12: before: 50 fps (as of Feb 28, before additional commits), after: 136 fps Win32 Vulkan: before: 74 fps (as of today using up-to-date master), after: 133 fps Linux Vulkan: before: 27 fps (as of Feb 28, before additional commits), after: 122 fps macOS Vulkan: before: 38 fps (as of today using up-to-date master), after: 58 fps

This work has been tested using my AMD 6600XT card on all OSs, as well as an internal Intel UHD 630 graphics processor (on my Intel 8700K) to test these sync changes on a slow GPU. So far so good. However, I would like to see how these operate on a faster GPU (i.e. a more powerful Nvidia or AMD chip). I would like to make sure the sync method I have implemented here does not break down on a faster GPU.

In addition this pull request also includes:

1) Save screenshots to fs_savepath (as per original Doom3 default) vs. fs_basepath 2) Fix SDL display mode size/refresh rate changes when already in fullscreen mode 3) Fix compilation errors when Optick profiling is OFF. (note that setting OPTICK=ON for macOS/linux exposes some issues with including optick.h in precompiled.h, but this may not be the main use case so I did not attempt to fix. On macOS the default profiler is Instruments' Game Performance tool)

Win32 DX12 before and after screenshots:

win32-dx12-before-pic1 win32-dx12-after

Win32 Vulkan before and after screenshots:

win32-vulkan-before win32-vulkan-after

RobertBeckebans commented 1 year ago

This is very interesting takes some time for reviewing. The BFG edition has already an improved syncing behaviour compared to vanilla Doom 3 that shifts the GPU work into the next CPU frame. In vanilla D3 it was just BeginFrame() -> build all GL commands, wait for completion as in EndFrame() and then swap the buffers immediatly. I also integrated Optick so we can see better what's happening concerning this.

Could you remove the screenshot path change? Vanilla D3 has savepath but it wasn't used on Windows and in general it is just annoying to only go to the home savepath just for sharing screenshots. Especially if you work on a mod you just want them drag out of your mod working directory. So that change was intentional in this source port.

RobertBeckebans commented 1 year ago

Also the Optick related changes seem to be wrong. The optick.h should be always included in precompiled.h and then if(OPTICK) target_compile_definitions(RBDoom3BFG PUBLIC USE_OPTICK=1) else() target_compile_definitions(RBDoom3BFG PUBLIC USE_OPTICK=0) endif() sets USE_OPTICK. So all Optick macros don't do anything if cmake -DOPTICK=OFF and don't need an extra #ifdef USE_OPTICK wrap.

RobertBeckebans commented 1 year ago

Sorry I haven't tested Optick on Linux before merging it into master. The last commit fixes the compilation but requires a rebase on your PR.

RobertBeckebans commented 1 year ago

Holy shit this is a speed increase from 143 fps to 240 on my AMD laptop.

rbdoom-3-bfg-20230305-123920-001

SRSaunders commented 1 year ago

Thanks for looking at this PR.

I will revert the screenshot change for Windows and Linux. On macOS putting the screenshots into basepath is pretty bad since it may be located inside of an app bundle.
I will also rebase the PR onto your Optick fix and test again for compilation on macOS / linux.
I will add one small revert for compileshaders.cmake to bring macOS up to the SPIRV target = vulkan1.2
I would also like to revert the parallel shader compilation test for macOS, but this depends on a pending PR in your nvrhi clone (https://github.com/RobertBeckebans/nvrhi/pull/3). I did notice that you have a pending PR on updating nvrhi, but are also facing parallel compilation problems on Windows. Does my pending nvrhi tools fix help in any way?

Mainly I am interested in your review of the perf improvement and if it's a safe change that meets sync requirements.

SRSaunders commented 1 year ago

Changes completed now. Thanks for the feedback on USE_OPTICK. I was not aware of the optick.config.h file and the setting in there was causing problems on linux/macOS.

RobertBeckebans commented 1 year ago

One of the worst case scenes in Doom 3 with up to 50 point lights and many of them cast shadows: All shots in 1080p

RBDOOM-3-BFG 1.4.0 release binaries with OpenGL 91 fps

rbdoom-3-bfg-20230306-121203-001

The master branch has few less draw calls because it fades out distant shadow casting lights and the shadowmap resolution is only half of 1.4.0 so all shadows fit into the 8192^2 atlas.

master branch with DX12 107 fps rbdoom-3-bfg-20230306-121252-001

sync-changes branch with DX12 176 fps rbdoom-3-bfg-20230306-121330-001

master branch with Vulkan 70 fps rbdoom-3-bfg-20230306-122203-001

sync-changes branch with Vulkan 111 fps rbdoom-3-bfg-20230306-122244-001

I also brought this up to discussion in Discord and these are the results by Samson (KozGit DOOM-3-BFG VR author):

master branch with DX12 429 fps rbdoom-3-bfg-20230305-155412-001

sync-changes branch with DX12 ... 666 (teleporter to hell speed) fps RBDoom3BFG_sync_2023_03_05_23_30_38_770

stephenap07 commented 1 year ago

I totally forgot about the waitForIdle I added. That was only supposed to be temporary.

There's a fence for each swap chain buffer that's supposed to handle the syncing that wasn't working right.

deviceManager->BeginFrame() is supposed to wait for the previous buffered frame:

void DeviceManager_DX12::BeginFrame()
{
    WaitForSingleObject(m_FrameFenceEvents[bufferIndex], INFINITE);
}

but I don't think that's placed in the correct spot. We could probably simplify this a bit by reusing these fences and moving around the DeviceManager:: calls.

RobertBeckebans commented 1 year ago

I don't understand why this requires NUM_FRAME_DATA = 3. If I understand this correctly then there should be only a single synchronization point between 2 frames in GL_BlockingSwapBuffers.

Is it at the moment that void DeviceManager_DX12::Present() { // SRS - Sync on previous frame's command queue completion vs. waitForIdle() on whole device m_NvrhiDevice->waitEventQuery( m_FrameWaitQuery ); m_NvrhiDevice->resetEventQuery( m_FrameWaitQuery ); m_NvrhiDevice->setEventQuery( m_FrameWaitQuery, nvrhi::CommandQueue::Graphics ); doesn't wait for the completion of the previous frame's NVHRI commandlist but the one second to last? I get a lot of flickering when I try NUM_FRAME_DATA = 2 on Windows.

It seems that GL_BlockingSwapBuffers became much faster during profiles: 2023-03-06 20_20_26-What the hell is going on_ - Optick Profiler

RobertBeckebans commented 1 year ago

This is how the GPU frame overlap looks like in Optick when r_swapInterval is 2:

2023-03-06 23_01_42-What the hell is going on_ - Optick Profiler

RobertBeckebans commented 1 year ago

This is the frame overlap with r_swapInterval 0 and com_fixedTic 1 to go over 120 Hz:

2023-03-06 23_07_11-What the hell is going on_ - Optick Profiler

The required changes are in branch 750-better-vsync.

SRSaunders commented 1 year ago

Thanks @RobertBeckebans for stress testing this possible change under various scenarios, and especially for reaching out on Discord for testing on a fast RTX 4090 rig. I believe NUM_FRAME_DATA = 3 helps by increasing the availability of free swap chain images as CPU/GPU parallelism increases (with constant rendering work per frame).

Regarding the flickering with NUM_FRAME_DATA = 2, I am concerned that I may not have implemented the correct sync operation to guarantee proper results under the more stressful condition of fewer swap chain images, i.e. double vs. triple buffering without waitForIdle() slowing things down. At least on Vulkan, I have verified that I implemented a wait on the barrier command list which terminates a frame. And I do believe the wait is on the previous frame's command list and not 2 frames back.

If @stephenap07 has an alternative sync idea with fences that could survive NUM_FRAME_DATA = 2 without image corruption I would be happy to see that improvement. As an aside, I played around with this concept on macOS with MoltenVK using the fence in vkAcquireNextImageKHR(), but could not get any improvement over what I did with nvrhi queue events. Perhaps I was doing something wrong so I am curious to see if Stephen P has a better solution here.

UPDATE: I posted this without refreshing the page for your most recent profiler captures. I was very happy to see the CPU/GPU parallelism achieved by this improvement. I will try out your 750-better-vsync branch on macOS and linux.

UPDATE 2: Tried out 750-better-vsync and it works fine. However, I had to apply your refactoring of m_DeviceParams, m_RequestedVSync, and m_windowVisible to sdl_vkimp.cpp. Perhaps you could update that file before merging into master, otherwise linux and macOS builds will be broken.

Also if you could add commit https://github.com/RobertBeckebans/RBDOOM-3-BFG/pull/750/commits/5aad7eb005384e195a4ea7f57a0a4edc848569b8 to your 750-better-vsync branch I would appreciate it. This macOS-only change disables Metal API validation for Xcode debug builds - which improves fps while debugging.

RobertBeckebans commented 1 year ago

Well waitForIdle() was some kind of explicit sync regardless of the v-sync implementation later in DeviceManager::Present(). If we have up 2 commandlists running in parallel on the GPU then it would explain why NUM_FRAME_DATA is required. Usually the vertexCache has the game thread filling the model data with the current frame and then the index is swapped is passed to the renderer backend where it records the DX12/Vulkan commandlists using those cache buffers and at the end it fires them to the GPU in idRenderBackend::GL_EndFrame(). If we are lagging behind 1 frame then the commandlist would require a third vertexCache refering to the model data in that frame in order to render correctly.

However it seems that tripple buffering is the default in the Donut samples and when debugging the the fence mechanism in DeviceManager::Present() you implemented then I could see that it was still refering to the background swapbuffer index of the previous frame so it should be waiting for n -1 and not n -2. Well at least when r_swapInterval is 1 or 2 which is the case we want anyway to avoid screen tearing.

The goal should be no screen tearing and a stable high performance.

Regardless of that it seems to be very stable and a big perf improvement even it might add 6 milliseconds of latency in 120Hz mode. You can even recognize at the corresponding heaters that the hardware is better utilized.

We can still change this but I would prefer to build to new binaries for this for testing. I merged you last commit and fixed the Linux compile issues. 750 goes into master until we have something even better.

SRSaunders commented 1 year ago

Thanks Robert for doing all this analysis and testing work to validate the performance improvement. I am happy to see that it's stable under a range of conditions.

I compiled the new master, and while it works fine, shouldn't the OPTICK_GPU_CONTEXT below be protected inside the #if USE_OPTICK_GPU wrapper, especially since it's a D3D12-only call, i.e. not so good for linux/macOS? Compilation works, but the IDE parser (e.g. Xcode, etc) doesn't like it.

void idRenderBackend::DrawViewInternal( const viewDef_t* _viewDef, const int stereoEye )
{
    OPTICK_EVENT( "Backend_DrawViewInternal" );
    OPTICK_TAG( "stereoEye", stereoEye );

    /*
    if( deviceManager->GetGraphicsAPI() == nvrhi::GraphicsAPI::D3D12 )
    {
    }
    */

    OPTICK_GPU_CONTEXT( ( ID3D12GraphicsCommandList* ) commandList->getNativeObject( nvrhi::ObjectTypes::D3D12_GraphicsCommandList ) );

#if USE_OPTICK_GPU
    //uint32_t swapIndex = deviceManager->GetCurrentBackBufferIndex();
    //idStr eventLabel;
    //eventLabel.Format( "DrawView( frameIndex = %i, swapIndex = %i ) ", taaPass->GetFrameIndex(), swapIndex );
    OPTICK_GPU_EVENT( "DrawView" );
#endif

RobertBeckebans commented 1 year ago

Yeah probably it's time for some continues integration setup. It's fixed now.

SRSaunders commented 1 year ago

Thanks.

RobertBeckebans / RBDOOM-3-BFG

Frame sync changes for higher performance & fps #750