SpecialKO / SpecialK

Lovingly referred to as the Swiss Army Knife of PC gaming, Special K does a bit of everything.
https://www.special-k.info/
GNU General Public License v3.0
938 stars 54 forks source link

Benchmarking hooks / External API to set frame rate cap / or scanline number during Latent Sync #76

Open mdrejhon opened 1 year ago

mdrejhon commented 1 year ago

Hello,

API Requirements

For an automated benchmarking app, doing separate tasks, and sometimes in combination

  1. API to set scanline number during Latent Sync. ARGS (1): Raster scan line number (integer) Not timing critical as long as I eventually have timestamp of eventual first frame that this executes upon

  2. API to set frame rate cap (non Latent Sync) ARGS (1): Frame rate cap number (float) Not timing critical as long as I eventually have timestamp of eventual first frame that this executes upon

  3. API to enable/disable always doing Flush() after Present(), this stabilizes tearlines. ARGS (1): true or false (boolean) Not timing critical as long as I eventually have timestamp of eventual first frame that this executes upon

  4. API to set flag to trigger a white flash (or specified color flash) in the SK's next scheduled frame presentation. Should be at one edge of screen, either left or right edge, enough room for a photodiode tester ARGS (2): ...edge color of frame flash in RGB value (integer) ...whether I want that frame to Flush() even if I didn't do step four (boolean) Not timing critical as long as I eventually have two timestamps: (A) of time of flag-set, and (B) of post-Present()

  5. API to set a callback function to send me RTDSC/QPC timestamps of all the various timestamps above. ARGS (1): Callback function. SK would call my callback to give me a timestamp and what it was for (you could use enums such as TIMESTAMP_CAP_CHANGE, TIMESTAMP_FLUSH_CHANGE, TIMESTAMP_POST_PRESENT, etc) Not timing critical for API call timing or callback timing

  6. API to set a non-flash color (prevent false alarms with in-app/in-game content) ARGS (1): Edge RGB color to draw for frames that are not flashed, or -1 to disable (signed integer) Not timing critical as long as I eventually have timestamp of eventual first frame that this executes upon

Immediacy of Item 4

The goal is I need item 4 to be able to tell SK's next frame presentation event (whenever it may be, at its existing schedule) to do a "draw flash + Present() + Flush() + immediately timestamp".

As long as a boolean flag in Item 4 can be set anytime randomly during the previous frametime, right up to roughly ~0.1ms (100us) prior SK's next scheduled frame presentation event -- I am happy. Timestamps provide the data that I need for my mathematics. It just simply needs to be a thread-safe boolean variable that the frame-presentation thread can check. And the same thread will also draw the white rectangle immediately before Present()+Flush().

SK does not need to touch/modify frame timing AT ALL. I was able to figure out I can use timestamps to solve my problem.

Flush() is important for certain things like stabilizing tearlines. It kills GPU performance, but it improves lag-test numbers for the purposes of determining display-only latency (GPU-port-to-photons) by nigh completely filtering the computer out of the lag number.

Timestamps Requirement

To eliminate timing-precision requirements from API calling, I have worked very hard to eliminate timing-precision requirements for SK, by simply requiring timestamps. Timestamps should be of a microsecond counter such as QueryPerformanceCounter() or RTDSC.

Except where additional timestamps are specified, timestamps are grabbed immediately after the Present() event, or immediately after the Present()+Flush() event corresponding to the said frame in item 1/2/3/4

I just want to be signalled of the timestamps, so I know to mathout/validate/invalidate my test results based on the stream of timestamps.

Goal: Beamracing to filter display lag from GPU/software lag

With all of this data, I have Blur Busters invented algorithms to successfully filter display lag away from GPU/software lag, during lag testing too. (This is called the "Blur Busters beam raced display latency filter algorithm").

"beam raced latency measurements" is an algorithm we've invented to be able to filter display latency separately of GPU/software latency). I would just lag-test all scanline numbers one by one (or binary search towards). After algorithming it all out (on the timestamps), the lowest lag number would be the display-latency only.

Blur Busters plans to open-source this algorithm for all photodiode testers, but we are successfully doing this with our in-house photodiode tester to filter display lag numbers away from GPU/software lag. (IMPORTANT: Credit to Blur Busters is mandatory,

Open to alternatives

I'm open to using RTSS or SpecialK, just need programmatic control of the scanline number from a background analyzer app if possible -- this won't be used in production gaming, but only during benchmarking sessions.

I realize RTSS might be "Right Tool For Right Job" since it currently has more API capabilities (at the moment) but I like SK's superior scanline sync, so, kind of trying to figure out how to have cake and eat it too.

If SK can do something like this, even as an undocumented setting, I'd be happy! I realize that this might be a big ask, but I bet this will be a big help to a ton of websites including ourselves.

Free Blur Busters Invention

If any lagtest vendors such as OSRTT, LDAT, etc implements beamracing in latency testing, as a method of filtering display lag from GPU/software lag, then please credit Blur Busters. If implementing this algorithm in your photodiode tester software or whatever open source Arduino tester you've homebrew'd. Thank you!

Alternatively, wait for us to publish a white paper on this (preferred), but this is revolutionary for a PC to be able to self-filter its own latencies completely, isolating GPU-output-to-photons latency (aka display lag), for certain kinds of benchmarking, or to colorcode latency bars, or other future visualizations (like framerate-vs-lag graphs for testing all VRR framerates, etc).

mdrejhon commented 1 year ago

Some discussion on Discord: https://discord.com/channels/778539700981071872/778539700981071875/1130619440719999059

mdrejhon commented 1 year ago

I have updated the information to simplify the feature request.

  1. I removed API timing precision completely!
  2. I removed the requirement for SK to modify its existing merry frame presentation timing

The only immediacy consideration is the thread-safe boolean flag, that needs to be settable anytime in the previous frametime, prior to frame presentation of new frame, right up to new frame. (And the presentation hook thread will do a last-minute white-flash modification of that said frame, even if I just happen to set the flag 0.1ms prior).

Flash should be maybe 5% of screen width at either left edge or right edge, sufficient room for a common photodiode tester not to slam against monitor bezels when trying to test that location. Paint it as a full-height rectangle for one screen edge.

EDIT: I just remembered I need an additional API to set/remove a non-flash color (e.g. black), so I don't get falsings from the in-game material (testing a game) or in-app (custom test patterns optimized to measure specific stuff, such as display-only lag)

Kaldaien commented 1 year ago

Okay, I'm getting to work on some of this now. I can accommodate all of these requests except for item 3/4.

Fundamentally, flushing after submitting a finished frame is meaningless in modern graphics APIs.

To elaborate, Special K already flushes OpenGL/D3D9/D3D11 before adding CPU delays (in all framerate limiter modes). This is because in those APIs there's a small chance that there's queued render commands that haven't been submitted to the GPU yet before the framerate limiter makes everything go idle. Normally, if there were no framerate limiter, the game's Swap Buffers/Present would include an immediate implicit flush and Special K is replicating the immediate flush but delaying the Present.

D3D12 and Vulkan have done away completely with explicit and implicit flushing, they begin executing commands on the GPU as soon as they're submitted.

The only scenario I can think of where a flush after present would do anything at all is single-buffered OpenGL. The command queue in all of these graphics APIs is flushed during any kind of double-buffered present, you can't finish a frame without the API flushing things.

I could eliminate the flush that Special K applies before framerate limiting, but I don't actually know what purpose that would serve? In my mind, that only introduces the possibility that after the framerate limiter wakes up there's some extra GPU work remaining to be done during presentation that might cause a missed deadline. It wouldn't increase framerate or anything like that, the GPU still has to do this work to complete a frame and the sooner it begins, the better.

mdrejhon commented 1 year ago

Request 4 (other than Flush) is essential for display lag filter algorithm

I hope you are only referring to the Flush part of Request 4. My limiting factor in measuring display lag depends on being able to fully filter software/GPU from computer.

The margin between setting flag, and the presenter seeing the flag and then doing a last-minute draw before presenter, will be my error margin in how much I can filter the computer/software from display lag.

If I cannot set the flag even as little as 0.1ms prior to your frame-present doing a last-minute flash, it throws the algorithm right out of the window and the other requests become useless, without that critical gating factor.

It's to say, that I need to be able to set a flag to tell your frame-present routine to suddenly add a last-minute draw. If it can't be done, the "beamraced display lag filter algorithm" is thrown out of the window as I would be fully unable to algorithm-out display lag from GPU/software lag...

That being said, I would still have other use cases for controlling the frame cap externally (e.g. controlling the frame rate of TestUFO running in a fullscreen window, if SK is successfully able to 'latch' onto Chromium's DirectX-based fullscreen buffer). I do have a need for a VRR-compatible TestUFO, and Chrome can run in --disable-gpu-vsync --disable-frame-throttle command line options, and then an external cap can control the TestUFO framerate as a result. On Windows I even see tearlines in Chrome when TestUFO runs at 2000fps, and it does kind of make WebGL fullscreen games work in VRR (if I force VRR via NVCP). So, provided SpecialK is able to latch onto the DirectX-API-based framebuffer of Google Chrome in fullscreen mode -- FINALLY, I can do VRR-TestUFO!

Then I can write a wrapper around Chromium Embedded Framework (CEF) and simply use SpecialK API to control the framerate of TestUFO - without modifying the Chromium source code, just forcing VRR and --disable-gpu-vsync and --disable-frame-throttle flags. So that'd be another use case for me, that is unrelated to measuring display lag.... if you're unable to do item 4 -- it's not useless.

It's possible Flush might be unnecessary, but some findings:

You flush early even before the frame is presented? Perhaps that's why SK behaves better than RTSS. How stable is Latent Sync tearline? Can it stable to near raster-exact?

I found it kinda depended on the graphics drivers. Flush was necessary on certain GPUs such as GTX 1080 Ti to stabilize the tearlines to near-pixel-exact positions. But as long as I can get that accurate tearline steering with SK as with an external Present()+Flush() app.

https://www.youtube.com/watch?v=OZ7Loh830Ec https://www.youtube.com/watch?v=tQW7-VbrD1g https://www.youtube.com/watch?v=6M9XdACBUnk

Also, GPUs often fall asleep while waiting for a frame to finally Present(), so when the GPU finally wakes up, the Present() is a bit lagged, while a repeat-Flush(), even if draw commands are done, sometimes does a dual-purpose to automatically end the GPU power management right on the spot.

But I did notice on my RTX 3080 GPU, adding a Flush didn't make as much a difference.

One great way to monitor whether Flush worked, is to intentionally configure Latent Sync or Scanline Sync to move the tearline on-screen (middle of screen) while displaying fast horizontal panning material. In RTSS Scanline Sync, enabling the flush setting in the config (Present+Flush) suddenly stabilized the tearline, at the cost of high GPU utilization (slowdown). That's OK, when I'm prioritizing display lag testing (beam raced lag filtering algorithm is most accurate with perfectly stationary tearlines, ala the youtube videos of Tearline Jedi).

Tearlines are amazing timer-precision debugging tools! Basically, you intentionally place a tearline at a specific raster, and experiment with various techniques to stabilize it (e.g. busylooping on QueryPerformanceCounter() produced much better timing precision than a timer). What kind of mechanism do you 'time' your scanline?

At 67KHz scan rate aka "horizontal refresh rate" or "number of scanlines pre second" (1080p 60Hz), a 2-pixel jitter in tearline translates to an error margin of 2/67000sec inaccuracy! Incredibe I can beamrace that accurately in mere MonoGame C#... So, the more stable the tearlines are during the beamraced lag filtering algorithm I've invented -- the more accuracy lag tests can become.

I don't know if Flush is needed here -- perhaps not -- but my litmus test will be comparing tearline stability in SK versus tearline stability in my external app (for the same complexity-type graphics). I would test it on both a GTX 1080 Ti and a RTX 3080, they behave very differently in frameslices/sec count and tearline-stability. Surprisingly, 1080 Ti has more stables rasters than RTX 3080 due to the more hyper-pipelined design of the 3080, but I can still get sub-0.1ms precision on tearlines.

But since the step 4 (last minute flash draw) is absolutely essential/germane to the algorithm, and since that buffers a single last-minute draw command (a single white rectangle, drawn by two solid triangles likely), that likely has to be flushed "again" to be deterministic. In other words, you've flushed the earlier draw commands -- but since you're adding a last-minute white rectangle to the framebuffer, then the framebuffer needs to be re-flushed so the draw command doesn't sleep/powermanage/etc with the amplified tearline jitter. On the other hand, it is very driver-dependant and GPU-dependant. It reduced tearline-jitter error margin of some GPUs by 90%

Regardless of what you do, as long as I can achieve near-zero tearline jitter (at least one one of my GPUs), I'm happy, even if I have to cherrypick a GPU that doesn't need a bit of additional flush help. Although that potentially limits the market.

I did experiments already with MonoGame in C#, and in theory I could just write a standalone app doing my "beamraced display latency filtering algorithm" -- since as long as I time tearlines that precisely, I've clearly succeeded at almost zero (wayyyy sub-1ms) Present()-to-VGA-output latency (oscilloscope on a VGA output via DVI-I adaptor of an older GPU card that still had the analog pins). So from that basis, I was able to come up with a method of filtering the computer/system from display latency. Since GPU-output-to-photons is considered largely the display latency (albiet digital cables will have a bit of transceiver latency including the display latency).

Regardless, instead of being forced to use only an app I write, I'm wanting to be able to use multiple software programs, including software I did not write, so that's why I'm asking SK to add an API instead. Also allows measuring game lag, etc. Fewer wheels reinvented, etc.

mdrejhon commented 1 year ago

What kind of mechanism do you 'time' your scanline?

Reminder! I'm curious, I could look at the code, but:

BTW, I often use tearline jitter as a visual (eyeballing) timing-precision debugging tool. With display horizontal scanrate (via QueryDisplayConfig()), the number of pixel jitter in your tearline = (vertical jitter in pixel) / (horizontal scan rate) = your timing imprecision in SpecialK

Like raster jitter in yesteryear 8-bit games being a debugging tool for beamracing / raster interrupt precision, today, tearline jitter is also still my timing-precision monitor even today.

I don't know if you now use precise GPU-scheduled frame presentation APIs (e.g. telling the GPU that you want a frame presented at a specific exact microsecond. This API is not available on all GPUs, sadly, and I want broad compatibility with all kinds of GPUs (including Intel GPU). The Beam Raced Display Lag Filter Algorithm is very GPU-independent.

Tearline jitter on one GPU became most stable with a double flush via "Flush-BusyLoop-Present-Flush", although some had to be thrashed with dummy draws (1-pixel faint-color changes, to a corner) to prevent power management jitter, aka "Draw-Flush-Draw-Present-Flush-Draw-Present-Flush-Draw-Present-Flush-[until-1-to-2ms-prior]-DrawFlash-Flush-BusyLoop-Present-Flush". Basically only busyloop for 1ms-ish or so, only for microsecond tearline alignment. But I can just do that in my own app most of the time. Basically thrashing the GPU power to prevent power management sleeps on GPUs that don't have precise-scheduled asynchronous frame presentation APIs. One can flush before/after present, but on one GPU, post-present flush was needed for rock-stationary tearline.

SpecialK doesn't have to do the thrash-trick; it's only useful to prevent GPU power management (sleeping between frames during low-GPU-%-utilization), but if I am only doing 1 tearline per refresh cycle, with simple test patterns, I'm only using GPU for 1% of the time, and then some GPUs go to sleep, and scheduled presentation jitters by a millisecond. So all the thrashing/flushing keeps the GPU on its toes, keeps it power use high, clockrate high, and assisted by a final CPU BusyLoop on QueryPerformanceCounter (eat 100% CPU of one core for 1ms), producing microsecond-accurate rock-solid-stationary tearlines (on some systems) that sometimes only jitter by 1 pixel. It's also useful if you need tearingless VSYNC to work properly at high frame rates at low CPU utilizations, in reduced-VerticalTotal situations (hard to dejitter tearline between refresh cycles).

I would disable all this flush-hacks when playing games, or measuring game latency -- it is only useful to filter GPU/computer from display lag, via perfect-stationary tearlines.

I can forgo SpecialK doing flush, and just do my own app to do the flushes, but it means I sometimes have to use a separate app to filter display lag, before I benchmark the game separately. And a few esoteric use cases (stabilizing tearlines in tiny VBIs on very high refresh rates on GPUs that don't have precise scheduled asynchronous frame presentation API capabilities) -- like tearingless VSYNC OFF during low-GPU-utilizations where power management adds annoying tearline jitter if not thrashed (with a dummy pixel) between frames to avoid power management, unless you have an API to tell the GPU not to do power management.

Now that said, I can forgo flush, just cherrypick a GPU, rather than a GPU-independent "Beam Raced Display Lag Filter Algorithm" using a various of thrash/flush tricks.

It's just a special consideration...