NVIDIAGameWorks / Streamline

Streamline Integration Framework
Other
377 stars 79 forks source link

ResourcePool::allocate() may starve other threads, causing hitches or stutters with dlssg enabled #33

Open Nukem9 opened 6 months ago

Nukem9 commented 6 months ago

When ResourcePool::allocate() reaches a limit (i.e. VRAM budget or maximum queue depth) it'll spin in a busy loop hoping another thread comes around and frees existing allocations via ResourcePool::recycle(). If said busy loop exceeds its time limit, it'll fall back to a brand new allocation instead. I'm assuming this loop doesn't execute under normal circumstances because Streamline doesn't spawn threads and games rarely parallelize slEvaluateFeature() calls.

However, once DLSS-G is enabled there's suddenly 3 threads competing with each other: a game (present) thread, a sl.pacer thread, and a sl.dlssg thread. The game and sl.pacer threads are often contended on ResourcePool's mutex, leading to a problem where ::recycle() is unable to progress after ::allocate() enters its busy loop. Streamline tries to mitigate this deadlock with the following code:

https://github.com/NVIDIAGameWorks/Streamline/blob/7ac42e47c7dd55b5b6dd6176c0228048636541b2/source/platforms/sl.chi/generic.cpp#L135-L149

There's an oversight on line 144 as std::mutex does not guarantee fairness. YieldProcessor() is one instruction and makes no real difference. Unlocking and relocking might wake other threads but ::allocate() can reacquire the lock before anybody else gets a chance. This is often the case on my machine.

Based on my not-so-scientific testing I usually hit that 100000us pause in games every 1-2s which results in a stuttery mess. This only occurs with vertical sync enabled through Nvidia Control Panel. Games are smooth with vertical sync off.

I annotated an Nsight trace while trying to understand what's happening. Possibly useful for someone:

jake-nv commented 6 months ago

Thanks for the detailed analysis and report. We're discussing this, but it will likely take a while to arrive at a fix everyone is happy with.

kirillNVIDIA commented 5 months ago

Yeah - we need Sleep(1) instead of YieldProcessor() there. I think it should be a rare condition though. This happens only when there are no resources to recycle. Normally the presenting thread should release the resource by the time rendering thread needs it. To fix it properly - we need a detailed description of the repro case so we can repro the bad case, then fix it, and then verify that the issue is fixed.

Nukem9 commented 5 months ago

Yeah - we need Sleep(1) instead of YieldProcessor() there. I think it should be a rare condition though.

Agreed. Although I was kind of hoping you guys would use a condition variable instead.

Normally the presenting thread should release the resource by the time rendering thread needs it. To fix it properly - we need a detailed description of the repro case so we can repro the bad case, then fix it, and then verify that the issue is fixed.

There's little information to add besides what's posted above. There's no easy repro. VRAM exhaustion is not a factor. What I do know is that I can reproduce minute stutters in a number of games (say, Cyberpunk 2077) in areas with light CPU load - probably because the kernel thread scheduler doesn't preempt the busy loop. I don't plan on root causing it as there's no source code or symbols available for sl.dlss_g.dll.

No amount of fiddling with settings seems to change things, so I've accepted it as a consequence of my HW/OS (Windows Server 2022) configuration. I binary patched various Streamline DLLs and that's a good enough "fix" for me.

Given the rarity, I don't think it's worth spending time investigating.