NvFBC retrieves slightly outdated images.

hgaiser commented 7 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Is your issue described in the documentation?

[X] I have read the documentation

Is your issue present in the nightly release?

[X] This issue is present in the nightly release

Describe the Bug

NvFBC has a few different methods to capture an image. The one used in Sunshine is:

  /*!
   * Capturing does not wait for a new frame nor a mouse move.
   *
   * It is therefore possible to capture the same frame multiple times.
   * When this occurs, the dwCurrentFrame parameter of the
   * NVFBC_FRAME_GRAB_INFO structure is not incremented.
   */
  NVFBC_TOSYS_GRAB_FLAGS_NOWAIT = (1 << 0),

Basically it means that whenever Sunshine requests a new frame, a frame is provided, but that frame can be "old" (max of 1/fps seconds old). Still though, it means at 60fps a frame could be 16msec old.

Changing to NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY would mean that the frame request blocks until a new frame becomes available, but it returns the frame immediately if NvFBC knows it's a new frame. This also means that when the host is serving static content, the FPS drops to 13.33FPS in my tests (not sure why this amount exactly).

  /*!
   * Similar to NVFBC_TOCUDA_GRAB_FLAGS_NOFLAGS, except that the capture will
   * not wait if there is already a frame available that the client has
   * never seen yet.
   */
  NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY = (1 << 2),

And for context (since that flag basically extends the NOFLAGS flag) :

  /*!
   * Default, capturing waits for a new frame or mouse move.
   *
   * The default behavior of blocking grabs is to wait for a new frame until
   * after the call was made.  But it's possible that there is a frame already
   * ready that the client hasn't seen.
   * \see NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY
   */
  NVFBC_TOCUDA_GRAB_FLAGS_NOFLAGS = 0,

As far as I can see, we can simply use NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY instead of NVFBC_TOSYS_GRAB_FLAGS_NOWAIT. This wait becomes kinda redundant, but it won't hurt either.

Expected Behavior

N/A

Additional Context

N/A

Host Operating System

Linux

Operating System Version

Arch Linux

Architecture

64 bit

Sunshine commit or version

7fb8c76590f843f28b2061cd0a1543f0710795e3

Package

other (self built)

GPU Type

Nvidia

GPU Model

GeForce RTX 3090

GPU Driver/Mesa Version

550.76

Capture Method (Linux Only)

NvFBC

Config

N/A

Apps

N/A

Relevant log output

N/A

cgutman commented 7 months ago

Does that work properly in the case where the host display's refresh rate is higher than the stream frame rate? IIUC, in that case, we could potentially capture and encode more frames per second than the client is expecting because there will often be a frame ready when we expect to be sleeping until the next frame is due.

hgaiser commented 7 months ago

We are asking NvFBC to sample at the requested framerate, so that isn't an issue:

https://github.com/LizardByte/Sunshine/blob/7fb8c76590f843f28b2061cd0a1543f0710795e3/src/platform/linux/cuda.cpp#L757

gschintgen commented 7 months ago

We are asking NvFBC to sample at the requested framerate, so that isn't an issue:

https://github.com/LizardByte/Sunshine/blob/7fb8c76590f843f28b2061cd0a1543f0710795e3/src/platform/linux/cuda.cpp#L757

I think that's a red herring. I had a look at this for #2333 and also thought that was the crucial line but it's rather this one: https://github.com/LizardByte/Sunshine/blob/7fb8c76590f843f28b2061cd0a1543f0710795e3/src/platform/linux/cuda.cpp#L814 I.e. we manually capture at the configured rate. The other line is probably just to put the capture feature in a reasonable state.

hgaiser commented 7 months ago

Did you test that with the current NOWAIT flag? Because in that case the call to NvFBC indeed does not wait. The value is presumably still important for the internal loop of NvFBC, but not for the FPS that Sunshine maintains.

Which raises an interesting issue. If NvFBC updates internally every 16msec and Sunshine always waits exactly 16.666msec between frames, then every 24 (16 / 0.666) frames it theoretically misses a frame. If that is true, it would be better to only let NvFBC do the sleeping (even though it can only do 16msec and not 16.666msec, so you'd get 62.5Hz).

gschintgen commented 7 months ago

Did you test that with the current NOWAIT flag? Because in that case the call to NvFBC indeed does not wait. The value is presumably still important for the internal loop of NvFBC, but not for the FPS that Sunshine maintains.

Which raises an interesting issue. If NvFBC updates internally every 16msec and Sunshine always waits exactly 16.666msec between frames, then every 24 (16 / 0.666) frames it theoretically misses a frame. If that is true, it would be better to only let NvFBC do the sleeping (even though it can only do 16msec and not 16.666msec, so you'd get 62.5Hz).

You’re right I did miss the last (but crucial) part of your initial post, i.e. the part where you point out the “manual” waiting done by Sunshine. Anyway, I only had superficial contact with this detail of the code base since I don’t have nvidia hardware and it came up in the mentioned PR discussion.

I guess it all depends on what exactly NvFBC does with its 16.000 ms sampling rate: a) It captures at 62.5Hz (i.e. the theoreticalnext_frame = previous + delay like currently implemented in Sunshine’s logic), which would probably lead to pacing issues due to the rate mismatch. But that sounds more like the NOWAIT approach. b) It captures a frame, then waits for 16.000ms and then blocks to wait for the next frame which will hopefully come at the 16.667 timepoint (due to a framelimiter). With perfect render timing that should theoretically lead to the best results (precise 60fps, no latency). But in that case I do think that 16ms is too long (rendering will still be subject to jitter) and frames will be missed.

Anyway I just wanted to give a heads up, but missed that you correctly saw sunshine’s own waiting.

hgaiser commented 7 months ago

I guess it all depends on what exactly NvFBC does with its 16.000 ms sampling rate: a) It captures at 62.5Hz (i.e. the theoreticalnext_frame = previous + delay like currently implemented in Sunshine’s logic), which would probably lead to pacing issues due to the rate mismatch. But that sounds more like the NOWAIT approach. b) It captures a frame, then waits for 16.000ms and then blocks to wait for the next frame which will hopefully come at the 16.667 timepoint (due to a framelimiter). With perfect render timing that should theoretically lead to the best results (precise 60fps, no latency). But in that case I do think that 16ms is too long (rendering will still be subject to jitter) and frames will be missed.

It seems like your option a) is correct. I tested this in moonshine (similar to sunshine, but only has NvFBC + NVENC), where I rely on NvFBC to block until a new frame arrives. According to moonlight I get roughly 62.5Hz if I stream dynamic content (in this case I just moved the cursor a lot) :

I guess sunshine waits to achieve the accurate 60Hz framerate, but by doing so, it skips a frame every 24 frames.

Anyway I just wanted to give a heads up, but missed that you correctly saw sunshine’s own waiting.

No worries! Appreciate it :).

gschintgen commented 7 months ago

Out of curiosity (since I'm not familiar with the inner workings of hardware pointers and such): Did you try what happens to the framerate as reported by Moonlight if you stream, let's say, an animated title screen of a game, framecapped at 60.00Hz, but without any mouse input?

  /*!
   * Default, capturing waits for a new frame or mouse move.
   *
   * The default behavior of blocking grabs is to wait for a new frame until
   * after the call was made.  But it's possible that there is a frame already
   * ready that the client hasn't seen.
   * \see NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY
   */
  NVFBC_TOCUDA_GRAB_FLAGS_NOFLAGS = 0,

If taken literally the framerate should then drop to 60fps. Maybe it's "just" the mouse input (e.g. 1000Hz gaming mouse) that messes with the capture timing. Is it possible to hide mouse input from NvFBC?

hgaiser commented 7 months ago

I wasn't entirely sure what would happen either, but it seems to still stream at 62.5Hz if I don't move the mouse in a game (even if the game I was streaming shows a framerate of a steady 60Hz according to Steam FPS overlay).

gschintgen commented 7 months ago

Ok, so it seems indeed like NvFBC calls its blocking image capture at a precise 16.00 ms and most of the time it will find a frame that it has not yet shown, so it immediately returns that one. Until another 24 or 25 second period is over and the difference accumulates to a whole frame time interval.

So the ideal capture routine for frame-capped (e.g. 60fps) content (which is not really under our control...) would be:

run a first blocking capture for a single frame
sleep for e.g. 80% of the theoretical frametime in order to avoid capturing way too many frames in case of uncapped rendering
emit another blocking capture for a single frame, which will come after ~16.7 ms. But without any latency between rendering and capturing.
goto 2

Except that it will break down if the game is running uncapped and step 2 will capture too early and too often.

Isn't this then similar to the problem that the Windows side of Sunshine has to deal with when it is using the Desktop Duplication API (which I don't know anything about except the few tidbits I picked up here and there)? I saw some pacing code in there... https://github.com/LizardByte/Sunshine/blob/9d5ee2f57d2c582b5b0f0c7eb67a8b86daffd9a9/src/platform/windows/display_base.cpp#L170-L200 (I did not try to analyze or understand this code in detail.)

hgaiser commented 7 months ago

Why would the second frame come after 16.7msec? I would expect the internal functioning of NvFBC to always retrieve frames after 16msec intervals, so if you wait for 80%, say exactly 13msec, that the next blocking call would return after another exactly 3msec.

gschintgen commented 7 months ago

Right, there's also the question what "blocking" actually means: a) Blocking until the timer is up. b) Blocking until a new frame is available. I suppose it means b). Isn't that the point of blocking vs NOWAIT?

If the game is running at 60fps / 16.667ms, then the blocking call executed 13ms after the previous capture would then wait until a new frame has been rendered and is ready to be captured. And those frames should be emitted at an interval of 16.666ms due to the framecap.

gschintgen commented 7 months ago

I think it all depends on when the timer will trigger next:

precisely x ms after the previous theoretical frame instant, independently on when frames are incoming
x ms after the previous capture (and then the capture will block until the 16.666ms are up)

It seems like NvFBC is doing the former, while the latter would be what's needed.

Unfortunately I can't test any of this. (I'm on Intel & AMD)

hgaiser commented 7 months ago

My suspicion is that NvFBC is running an internal loop, separate of the frame generation loop, which polls the latest frame after every dwSamplingRateMs milliseconds. I ran the following code with nvfbc-rs:

    capturer.start(BufferFormat::Rgb, 60)?;

    // In case it needs to warm up or initialize something.
    for _ in 0..10 {
        capturer.next_frame(CaptureMethod::Blocking)?;
    }

    let now = std::time::Instant::now();
    for _ in 0..100 {
        std::thread::sleep(std::time::Duration::from_millis(13));
        capturer.next_frame(CaptureMethod::Blocking)?;
    }
    println!("{}", now.elapsed().as_micros() / 100);

Meaning we set the framerate to 60Hz (which is used to set the sampling rate to 16 msec), sleep 13msec and then wait for a next frame in a blocking manner. Repeat this 100 times and get the average time waited on new frames. I am getting this timing:

Without the 13msec sleep I also get 16016, roughly the same amount.

gschintgen commented 7 months ago

Hm, but somehow it must be possible to make NvFBC actually wait for a new frame:

Default, capturing waits for a new frame or mouse move.

     * When using blocking calls each captured frame will have
     * this flag set to NVFBC_TRUE since the blocking mechanism waits for
     * the display server to render a new frame.
      NVFBC_BOOL bIsNewFrame;

     * The default behavior of blocking grabs is to wait for a new frame until
     * after the call was made.  But it's possible that there is a frame already
     * ready that the client hasn't seen.
     * \see NVFBC_TOSYS_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY

At least with NVFBC_TOSYS_GRAB_FLAGS_NOFLAGS the call should always be blocking, no? That should then drag out the interval to the 16.666ms (supposing a framecap on the game as always).

With NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY we could have the same behavior than with the simple NOWAIT? I.e. every time the 16.00ms timer fires it detects that a new frame is already there, it emits it, and 16.00 ms later it again detects that a new frame is immediately ready, etc. until 24-25 seconds are up.

hgaiser commented 7 months ago

Hm, but somehow it must be possible to make NvFBC actually wait for a new frame:
Default, capturing waits for a new frame or mouse move.
     * When using blocking calls each captured frame will have
     * this flag set to NVFBC_TRUE since the blocking mechanism waits for
     * the display server to render a new frame.
      NVFBC_BOOL bIsNewFrame;
     * The default behavior of blocking grabs is to wait for a new frame until
     * after the call was made.  But it's possible that there is a frame already
     * ready that the client hasn't seen.
     * \see NVFBC_TOSYS_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY
At least with NVFBC_TOSYS_GRAB_FLAGS_NOFLAGS the call should always be blocking, no? That should then drag out the interval to the 16.666ms (supposing a framecap on the game as always).

With NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY we could have the same behavior than with the simple NOWAIT? I.e. every time the 16.00ms timer fires it detects that a new frame is already there, it emits it, and 16.00 ms later it again detects that a new frame is immediately ready, etc. until 24-25 seconds are up.

Ah sorry, I had renamed NOFLAGS to Blocking in https://github.com/hgaiser/nvfbc-rs/blob/main/nvfbc/src/cuda.rs#L41 . So I was using NOFLAGS, which still waits approximately 16msec, not 16.666msec.

gschintgen commented 7 months ago

That's weird. (And I'm out of ideas.)

hgaiser commented 7 months ago

That's weird. (And I'm out of ideas.)

That's okay, we tried ;)

A mystery for another day :magic_wand:

Fell commented 4 months ago

Not sure if it helps, but this might be the reason why I experience a weird stutter behavior with NvFBC, but it's completely smooth with KMS screen capture.

When testing with VRRTest set to 60 FPS, I can see that the moving bars hitch back and forth momentarily, as if the stream is showing an outdated frame but then immediately returns to the normal up-to-date frame. I have driver version 555.58.02 with explicit sync enabled (kwin 6.1.3)

When I enforce KMS capture without changing any other settings the stream is buttery smooth.

hgaiser commented 4 months ago

Not sure if it helps, but this might be the reason why I experience a weird stutter behavior with NvFBC, but it's completely smooth with KMS screen capture.

When testing with VRRTest set to 60 FPS, I can see that the moving bars hitch back and forth momentarily, as if the stream is showing an outdated frame but then immediately returns to the normal up-to-date frame. I have driver version 555.58.02 with explicit sync enabled (kwin 6.1.3)

When I enforce KMS capture without changing any other settings the stream is buttery smooth.

Can you test sunshine with the flag changed? Curious if it helps in your case.

Fell commented 4 months ago

I recompiled Sunshine with the changes outlined below. Unfortunately, the hitching/stuttering was still present, albeit subjectively less intense. It certainly didn't make things worse.

diff --git a/src/platform/linux/cuda.cpp b/src/platform/linux/cuda.cpp
index b5374b18..9d00a4a8 100644
--- a/src/platform/linux/cuda.cpp
+++ b/src/platform/linux/cuda.cpp
@@ -877,7 +877,7 @@ namespace cuda {

           NVFBC_TOCUDA_GRAB_FRAME_PARAMS grab {
             NVFBC_TOCUDA_GRAB_FRAME_PARAMS_VER,
-            NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT,
+            NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY,
             &device_ptr,
             &info,
             0,
@@ -932,7 +932,7 @@ namespace cuda {

         NVFBC_TOCUDA_GRAB_FRAME_PARAMS grab {
           NVFBC_TOCUDA_GRAB_FRAME_PARAMS_VER,
-          NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT,
+          NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY,
           &device_ptr,
           &info,
           (std::uint32_t) timeout.count(),

hgaiser commented 4 months ago

I recompiled Sunshine with the changes outlined below. Unfortunately, the hitching/stuttering was still present, albeit subjectively less intense. It certainly didn't make things worse.

Ah too bad, then I guess your stuttering is not related to this setting (at least not uniquely).

I don't mean this as advertisement.. but you're welcome to try moonshine, which works with the same components (NvFBC + NVENC) and implements the same protocol, so should be "plug-and-play". Would be interesting to see if you have stuttering there too. If that is the case, it rules out a lot of possible causes.

Fell commented 3 months ago

I did a bit more testing and I believe I isolated a frame timing issue with NvFBC, I created a diagram: nvfbc-frame-timing Observed behavior: Every few hundred milliseconds, a frame is displayed twice. A couple hundred milliseconds later, a frame is skipped. Imagine labelling all even frames A and all odd frames B. Now, we display them at 60 Hz on the host machine. On the host-connected monitor we see the A/B frames flickering back and forth. If we now stream at 30 Hz, we would expect the client machine to either only see A-frames or only B-frames. However, due to some unknown timing issues when NvFBC is in use, the video stream alternates between showing only A-frames and only B-frames about 2-3 times per second. This behavior is also observeable at 60 Hz, although a bit more difficult to notice.

Steps to reproduce:

Setup a host system with working NvFBC capture.
Set the monitor refresh rate to 60 Hz on both client and host.
Start streaming
Launch VRRTest on the host machine.
Set VRRTest to 60 FPS using the up/down arrow keys.
Observe the moving vertical bars.

On the host machine, the bars are moving smoothly. On the client machine, they are hitching back and forth as frames are skipped and repeated.

Desired result: The bars should be moving as smoothly as on the host.

Another way to reproduce if your display supports 120 Hz:

Setup a host system with working NvFBC capture.
Set the monitor refresh rate to 120 Hz on the host, but 60 Hz on the client.
Start streaming
Open the black frame insertion test on the host machine.
Observe the moving UFOs, specifically the bottom one.

On the host machine, the bottom UFO is flickering as black frames are inserted for every second frame. On the client machine, the bottom UFO is appearing and disappearing as the stream is switching between even and odd frames constantly.

Desired result: The bottom UFO should be either always visible or always hidden.

Circumstances which do not affect the bug:

The problem has been confirmed on all of the following configurations:

Host: Wayland, Client: Wayland Host: X11, Client: Wayland Host: X11 (no compositing), Client: Wayland Host: X11, Client: X11 Host: X11 (no compositing), Client: X11 (no compositing)

The problem appears regardless of the following NVIDIA settings (on the host): Sync to VBlank: on/off Allow Flipping: on/off Allow G-SYNC/G-SYNC Compatible: on/off Force Composition Pipeline: on/off Force Full Composition Pipeline: on/off

The NVFBC_TOCUDA_GRAB_FLAGS_NOWAIT_IF_NEW_FRAME_READY patch above does also not affect the bug in any way. I have tried both and observed the same result.

The problem is absent on Wayland using KMS capture.

The problem seems to be absent on X11 using X11 capture, however a more general frame pacing issue (possibly due to X11 itself) is making it hard to judge.

General system information (host):

Operating System: Arch Linux KDE Plasma Version: 6.1.4 KDE Frameworks Version: 6.5.0 Qt Version: 6.7.2 Kernel Version: 6.10.7-arch1-1 (64-bit) Graphics Platform: X11 Processors: 24 × AMD Ryzen 9 3900X 12-Core Processor Memory: 31.3 GiB of RAM Graphics Processor: NVIDIA GeForce RTX 2080 Ti/PCIe/SSE2 NVIDIA Driver Version: 560.35.03

hgaiser commented 3 months ago

That's some deep analysis :) I can't really say much about what might cause the issue you're seeing, but it appears unrelated to this issue since changing the flag doesn't seem to make a difference. It might make more sense to open a separate issue?

Did you happen to try moonshine? Since it's using a completely different implementation, I'm curious if you see similar issues there.

LizardByte-bot commented 9 hours ago

It seems this issue hasn't had any activity in the past 90 days. If it's still something you'd like addressed, please let us know by leaving a comment. Otherwise, to help keep our backlog tidy, we'll be closing this issue in 10 days. Thanks!

hgaiser commented 6 hours ago

I think this is still relevant, so posting this so that the kind bot doesn't close it.

LizardByte / Sunshine