Closed coornio closed 2 years ago
@TheLastRar One for you bud
Upon request, confirmed games that experienced this issue:
Splinter Cell (SLUS_20652) Dark Cloud 2 (SCUS_97213) Crash Bandicoot Wrath of Cortex (SLUS_20238) Digimon World 4 (SLUS_20386) Rogue Galaxy (SCUS_97490) Ace Combat Distant Thunder (SCES_50410)
I picked a variety of games to test, and each and every single one suffers from this.
@coornio what builds did you try to confirm where it started happening?
From: https://buildbot.orphis.net/pcsx2/index.php?m=fulllist
Tried the default 1.6 version at first, and it worked fine. Then switched to dev builds:
600 - works 903 - works 1149 - broken 1096 - broken 1002 - broken 942 - works 944 - works 1011 - broken (yes I went the wrong number here) 990 - broken 997 - broken 952 - works 974 - broken 958 - broken 953 - works 955 - broken
Thus 944 (no build) and 955 are the culprits.
I can't seem to reproduce this to the same level of severity that you are (I'm maybe seeing 955 deviate by +- 0.03, while 953 deviates +-0.01 from target fps, much less than the whole fps difference you are seeing).
This is a shot in the dark (and I mean that to the fullest extent), but can you try this build pcsx2_test1.zip
Appears to be more consistent, and the dips are way more infrequent now, only dipping as low as 54, usually 57. I shall make a quick recording for reference and attach it here.
I've also made a build of 954 (unmodified), I suspect that of the two commits, 954 is the more likely one to be at fault, but I would like you to confirm
Will do and report back, in-emu recordings show no OSD nor the lag so I gotta figure something out with obs it seems.
EDIT: it's worse than the first test build you gave me actually, worse than the OG 955 even, EE is all over the place and framerate won't sit anywhere near 60 despite easily doing more when the limiter is off.
In order of performance: 953 > 955 (test 1) > 955 > 954 (test 2)
short videos:
https://youtu.be/cuB0ExMSDIo (954 - test build 2) https://youtu.be/wq4lz155RCI (953 working fine)
For the test build, I uncapped FPS near the end to show you that the computer is indeed capable of running much faster and it's not a hard performance limit. I left the mic open in both of them so volume is a bit off.
I guess that confirms it, although I wasn't expecting it to be worse than 955.
Can you try this build as it reverts one of the changes made in 954. pcsx2-test3.zip
Can you also monitor the CPU frequency you are running at (using something like CPU-Z), incase CPU re-clocking is somehow interfering) and also do the same for 953 & 955.
Seems you're on to something here. It's rock solid 59.94 fps, hardly ever changes, just like in 593.
As for the CPU freq, don't worry, I did keep an eye on it and it's pegged at 4.8 Ghz constantly while the emulator runs. I had Intel XTU open on the side and there were no fluctuations.
Not sure what you changed in this instance but it did the trick.
I removed the condition here https://github.com/PCSX2/pcsx2/blob/0800cdbbc28db490c5efa5f33e4d0edc5a059c8e/pcsx2/DEV9/DEV9.cpp#L222
InitNet() attempts to open a TAP adapter for networking, and if successful, starts a background thread for receiving packets. This is unnecessary, so this was changed in 954 to only run when ethernet was enabled in the DEV9 settings.
I'm assuming you don't have tap, so It should fail to open TAP and return before starting that thread. Can you post you emulog file from test3
a quick search through it shows nothing related to ethernet/dev9/tap maybe I need to enable the Dev/verbose category too? not sure, let me know which log parameters to enable and i'll get a fresh log
The lack of relevant information was actually relevant.
Turns out, the code for opening TAP also checks if ethEnable is true, and aborts opening, but still reports success (due to uninitialised memory), meaning InitNet() continues and creates the extra background thread.
Now the questing remains, how that extra background thread failing to read anything in the loop is improving the framelimiter.
Good to hear that something is indeed wrong and it's not some phantom issue i'm dealing with once more. If there's anything else you'd like me to test down the line, drop a message here or on discord.
Shot in the dark, maybe the frame limiter is designed around the existence of the inactive net thread and not having it around at all is throwing off its metaphorical balance of calculations? Perhaps this setup wasn't an oversight, but intentionally designed that way.
EDIT: Updating title to make issue clearer
This build has a dummy RxThread, to determine if there is some interaction that the rx thread does that I'm not aware of.
Can you test it. pcsx2-test4.zip
Just did, works solidly. Seems my theory about the absence of a thread throwing the calculation off somehow might be on the right track :P
It does strike me as weird, as that part of the code was part of an optional plugin in 1.6, but it certainly has a very strong effect in your case,
Here is another test, which further minimises the code used to initialise the dummy thread. pcsx2-test5.zip
Stable once again.
Now the questing remains, how that extra background thread failing to read anything in the loop is improving the framelimiter.
Active Wait loops are better for precise frame limiting than passive waits that sleep for a % of the wait time in scenario's where the cpu usage can variate significantly - this is due (in part) to less of the cpu being in C1 or higher states when windows is performing its affinity balancing (which happens when thread masks are not explicitly pinned to affinity masks)
Similarly, having a thread spinning its wheels doing nothing significant with the core will behave in much the same way as an active wait just keeping the core from entering a state where the pipes are flushed and core disconnects from the bus.
RTSS as of late last year switched from active to a 90% passive wait (spending only 10% active) and this was found to introduce significant variance to limited framerates that are not present without the limiter or configured back to active.
// If any integer value of milliseconds exists, sleep it off.
// Prior comments suggested that 1-2 ms sleeps were inaccurate on some OSes;
// further testing suggests instead that this was utter bullshit.
Whoever concluded this comment in the code needs to re-evaluate their stance.
@Squall-Leonhart we know that, but we're not talking about the framelimiter loop, that's what makes this strange, this is the DEV9 thread. Please know what we're talking about before replying.
And I along with 3 other of us concluded that by testing multiple windows and linux installations, and the inaccuracy was all < 1ms
Yes, and its possibly just keeping some more of the cpu awake and at max clock mitigating innaccuracy of a sleepy wait loop when ticking against a cpu who's multipliers increase and lower with load
Toss the reportee a build with the Dev9 thread gone, and the framelimiters periodic sleep removed, its an easy thing to test.
PS: I hate windows scheduling
As was suggested in discord just earlier, I will attempt later to disable SpeedStep from the BIOS on my end to check if there might be any difference. Not quite certain if anyone else testing this did so on an AMD system or an Intel one, but squal's idea might have some merit and I see nothing wrong to try and dig into it further seeing how no one else can reproduce this yet.
Also for anyone feeling curious, yes the issue still persists post-954, still present.
@Squall-Leonhart, for quick reference, I was given 5 test builds, the first two did not work, the latter 3 did work fine. I am not clear on the exact changes of all, you'll want to ask @TheLastRar for details.
I don't think the interaction between the framelimiter and the dev9 thread will be anything to do with it. We changed the frame limiter before dev9 was merged and this wasn't a problem then when using a non-threaded dev9null plugin. So I don't think that's going to be the issue.
Wouldn't this imply though that this regression only came to be under these oddly specific circumstances as a result of adding up the changes on both the framelimiter and dev9? Wouldn't be the first time I've seen bugs come to be in this manner from seemingly innocuous changes. Just throwing food for thought. If I'm not mistaken, the latter 3 test builds specifically ensured that there'd be a dev9 thread, regardless of whether it does anything useful, and its absence throws the limiter all outta wack.
In any case, if you have any more tests to suggest to me, poke me whenever and I'll get to them when I have the chance.
well the code is easy to change in the framelimiter, it currently only sleeps if more than 1ms (so 2 or more) of time is needed to wait, and it'll wait for "msec - 1" so you can change that to check msec > 4 and wait for msec-4 then it'll do the rest in a tight loop. I mean it might be tighter anyway but it was so negligible before it didn't matter.
if (msec > 1) // Change this to msec > 4
{
Threading::Sleep(msec - 1); // Change this to msec - 4
}
// Conversion to milliseconds loses some precision; after sleeping off whole milliseconds,
// spin the thread without sleeping until we finally reach our expected end time.
while (GetCPUTicks() < uExpectedEnd)
{
// SKREEEEEEEE
}
I don't think the interaction between the framelimiter and the dev9 thread will be anything to do with it. We changed the frame limiter before dev9 was merged and this wasn't a problem then when using a non-threaded dev9null plugin. So I don't think that's going to be the issue.
I didn't think Rivatuner was causing my 80FPS uncapped instance of FFXIV to variate down to 55fps at a 60cap, but it did until i removed the 90% sleep period from the equation. :P
Lots have changed lately with PCSX2, the removal of the plugin system in itself has assumedly changed th amount of threads and handles operating between Core and IO components.
Wouldn't this imply though that this regression only came to be under these oddly specific circumstances as a result of adding up the changes on both the framelimiter and dev9? Wouldn't be the first time I've seen bugs come to be in this manner from seemingly innocuous changes. Just throwing food for thought. If I'm not mistaken, the latter 3 test builds specifically ensured that there'd be a dev9 thread, regardless of whether it does anything useful, and its absence throws the limiter all outta wack.
I doubt its even entirely changes within PCSX2, windows itself has performance differences thanks to all those lovely exploit patches that nobody was even going to figure out until researchers pointed them out.
As for the CPU freq, don't worry, I did keep an eye on it and it's pegged at 4.8 Ghz constantly while the emulator runs. I had Intel XTU open on the side and there were no fluctuations.
Most applications until very recently only reported clocks as a Multi x FSB, XTU is one of them. Effective Clock rates are often significantly different, especially on Skylake+ processors that have on chip power management.
It would be an idea to supply a Very Sleepy profile from affected and non affected builds.
Downloaded and installed, will give it a run later on when I have the free time of more than a few minutes :P
Anything special I should do with it? I assume I'll just have to pick pcsx2.exe from a working and a borked build, get to my testing spot ingame, then profile all for a while and attach the resulting files here, correct?
Correct.
Somebody else is going to have to test this, I tried the games suggested (that I own) on the latest version with the framelimiter enabled and aside from while loading it was rock solid down to the 10's of an fps for me, barely moved from 59.94 at all.
Before we go down the windows scheduling rabbit hole I think we should address this low hanging fruit: https://github.com/PCSX2/pcsx2/blob/6f7890b709d5e3f7f5b824781e493455efc92339/pcsx2/DEV9/net.cpp#L40-L51
This is almost certainly just pinning one of the threads to 100% Stick a sleep of 1ms in there and see if the issue gets resolved.
I think you misread the thread @tadanokojin
Now the questing remains, how that extra background thread failing to read anything in the loop is improving the framelimiter.
implies that its the absence of the NetRxThread(or a dummy of it) that is causing coornio's frame limiter to misbehave, not the active presence of it (or am i reading it wrong??)
Somebody else is going to have to test this, I tried the games suggested (that I own) on the latest version with the framelimiter enabled and aside from while loading it was rock solid down to the 10's of an fps for me, barely moved from 59.94 at all.
What cpu are you using these days?
Yeah @Ziemas filled me in.
I'm assuming the yield spam is causing the EE thread to take time more often than it normally would. Either way we need to sort that, clearly it hid an issue.
Before we go down the windows scheduling rabbit hole I think we should address this low hanging fruit:
This is almost certainly just pinning one of the threads to 100% Stick a sleep of 1ms in there and see if the issue gets resolved.
When Ethernet is enabled, both pcap & TAP will use blocking reads (inside nif->recv()
).
You might be able to peg the thread at 100% if you are reviving alot of packets, but I suspect rx_fifo_can_rx()
will start returning false pretty quickly in such a situation as the rx FIFO becomes full (at which point the thread would start yielding)
When Ethernet is disabled (pre 954), there is no longer a blocking call, nif->recv()
instead returns false,
This means we are yielding in a loop (along with a failing ReadFile that TAP performs in a such situation)
I've not measured the performance impact myself (being unaffected by the issue at hand), and this thread is only enabled with Ethernet post 954 anyway.
Additionally the extra thread ends up stabilising the framelimiter.
As for switching it to a sleep, prior to switching to std::thread, the windows code used 10ms, while linux code didn't even yield. 1ms is however used in CLR_DEV9, its winsock backend is non-blocking, along with its pcap implementation, although I did pluck that from thin air without much thought.
I kindof went with yield as a middle ground between CLR_DEV9 and the linux code
Okay here's some tests with the framelimiter sleeping at different levels.
Can you try your games to see where the framelimiter becomes stable, start with 2ms and work up to 6, the lower we can keep it, the better.
https://drive.google.com/file/d/1Zn6yfx2H1IJTsVqRKjtR-zOkLXrt3zPN/view?usp=sharing
Okay here's some tests with the framelimiter sleeping at different levels. Can you try your games to see where the framelimiter becomes stable, start with 2ms and work up to 6, the lower we can keep it, the better.
@refractionpcsx2 I gave them a try just earlier. The higher the ms, the less pronounced the dips and inconsistency was. However, at the same time this also increased the EE usage much more. On a working build it idles around 30-40 during my test, but by the 6ms test build it was idling in the 70-60 EE, with peaks in the 80-90 range when the dips occurred.
So the change is having an effect towards the better but is by no means a proper fix in this situation. If anything, the 2ms test build appears to perform the worst out of all so far, hardly managing to hit 60 frames before crapping out and falling again.
I will look into the profiling suggested earlier on next.
Okay here's some tests with the framelimiter sleeping at different levels. Can you try your games to see where the framelimiter becomes stable, start with 2ms and work up to 6, the lower we can keep it, the better.
@refractionpcsx2 I gave them a try just earlier. The higher the ms, the less pronounced the dips and inconsistency was. However, at the same time this also increased the EE usage much more. On a working build it idles around 30-40 during my test, but by the 6ms test build it was idling in the 70-60 EE, with peaks in the 80-90 range when the dips occurred.
So the change is having an effect towards the better but is by no means a proper fix in this situation. If anything, the 2ms test build appears to perform the worst out of all so far, hardly managing to hit 60 frames before crapping out and falling again.
I will look into the profiling suggested earlier on next.
okay then this isn't the problem, there's something else going on because requiring more than 6 msec before it sleeps (and only spinlocks) is ridiculous. And yes that's why I said the lower we can keep it, because the EE usage goes mad.
So the search for the problem continues.
Can you try this build? This eliminates the sleep altogether on windows and uses waitable timers, which may prove to be more accurate.
Nah this is bad, hardly ever reaches 60 frames, EE jumps between 50-90 on idle, and uncapped gets just over half the previous frames that it could.
huh? this doesn't affect uncapped.... can you retest please and make sure you're not running anything else.
My bad, my memory was off. On the same scene, your last test build actually gets more consistent uncapped fps, nearing 190. Meanwhile 593 gets around 175-185 usually, give or take 10 on either end during rare peaks/dips.
Definitely did not do anything for capped framerate consistency though. Might be worth considering keeping this sort of change if it helps uncapped performance consistently in more games :P
I can't express this clearly enough. This does nothing for uncapped FPS, anything you experience uncapped is purely down to your machine or placebo.
All I can about is the capped FPS consistency, and this has completely eliminated the use of Sleep (which is what's in question here), what is the framerate fluctuation? how much? it shouldn't be 60, it should be 59.94.
If you could show a frametime graph with rivatuner that would be good too.
after you've done that, I need the same information from this build. It will peg your EE, but I don't care about that right now, that doesn't matter.
I can't express this clearly enough. This does nothing for uncapped FPS, anything you experience uncapped is purely down to your machine or placebo.
I mean, I wouldn't call it placebo, the numbers were pretty darn clear in this regard, I can get you a video of that if you wish. Whether it's some sort of quirk of my hardware I don't know but I definitely did not imagine it.
As for whether that build helped or not, the answer is a "hell naw". It was consistently worse, hardly going to 59.94 ever, mostly sticking closer to 52. A reckon a video will speak volumes so I'll get you one in short order, including one for the results of that spin version.
so the last build eliminated the sleep completely, like the 2ms 3ms 4ms etc builds you said consistently got better as you went up, this removes the sleep completely and just spinlocks, so it should be a solid framerate going off your previous information.
So I dunno what's going on anymore.
and again, it doesn't even come to the code when you uncap the fps, there's literally a "is it uncapped? return" at the top of the function to make sure it never runs any of it. So I would suggest something else is going on with your PC.
The spinlock version you attached last is golden, practically perfect 59.94, the worst off it ever got being 59.89 so I'll count that as a win. As you said, the EE is also pegged at max, but that's a false reading in this case :)
bit of a table for reference so far:
spinlock: perfect frame consistency, VU/GS averaging 15% on the test scene, false EE reading but basically the same as 593 otherwise.
waittime: absolute mess, never hits full framerate, EE all over the place and VU/GS averaging ~40%
6ms: high EE usage, also results in ~20-25% VU/GS average, frequent dips in the 57 range every few seconds
2ms: frequent dips every few seconds, almost reaching 45 with the bad ones, EE all over the place, VU/GS similar to 6ms version until dips occur where they jump in the 40-50% range
To check if the issue is CPU power states / cache or just windows doing a bad job of waking us up on time, try adding this after the sleep
u64 wake = GetCPUTicks();
if (wake > uExpectedEnd)
{
fprintf(stderr, "FrameLimiter: Woke up %ldµs too late!\n", ((wake - uExpectedEnd) * 1000000) / GetTickFrequency());
}
okay then this isn't the problem, there's something else going on because requiring more than 6 msec before it sleeps (and only spinlocks) is ridiculous. And yes that's why I said the lower we can keep it, because the EE usage goes mad. @refraction
was worth a look atleast.
I see RDTSC is used directly in places such as GSPerfmon,
I actually found RDTSC to impair threading performance in CEMU when it was set to and used directly rather than using it via QPC, this was observed to have less consistent framerates as well as significantly affecting loading times (In BOTW) when using its triple core recompiler - And it had significant reproducibility variation amongst other machines, is there any connection to the framelimiter here or is GSPerfmon purely a report function?
Microsoft themselves advise not using it directly
We strongly discourage using the RDTSC or RDTSCP processor instruction to directly query the TSC because you won't get reliable results on some versions of Windows, across live migrations of virtual machines, and on hardware systems without invariant or tightly synchronized TSCs. Instead, we encourage you to use QPC to leverage the abstraction, consistency, and portability that it offers.
And it had significant reproducibility variation amongst other machines, is there any connection to the framelimiter here or is GSPerfmon purely a report function?
Purely reporting and nopped out on linux/macos for the most part. The spinlock showed the issue is likely inconsistent wake times in the framelimiter.
The spinlock version you attached last is golden, practically perfect 59.94, the worst off it ever got being 59.89
That isolates the wait as the issue.
This part is an edit to summarize the results of all the research and back and forth that took place below. As the current title implies, #4214 introduced an optimization to the net code that basically stopped the net thread from running under all circumstances (when the emulator is running) and doing nothing. A very sensible approach to optimize the code and avoid wasted cycles.
This optimization has resulted in a regression/bug of sorts, though it's not truly the fault of the emulator and the kind people who sustain it, but rather a freak coincidence as it turns out.
Intel Speedstep/Speedshift is the technology of Intel CPUs to adjust the frequency of each core depending on its load, and also balance the max overclock frequency depending on how many cores are burdened with work. It is why you see clock tables for new processors that say up to 5.2 Ghz for example on a single core or 4.7 Ghz on all cores. AMD does the same thing through their own technology, though apparently it's less dumb.
The issue that arose in this particular situation, which, annoyingly, is very difficult to reproduce due to its requirements, is as follows. You start a game, stand on the side, and the frame limiter kind of freaks outs. The FPS counter is all over the place, falling by as much as 20% lower than proper.
This is where SpeedStep/Speedshift comes in. In its duty to regulate the core frequencies to ensure a lower power consumption where appropriate, it destabilizes everything. It basically comes down to having a particular frequency range and a particular core load where one moment it judges the load is insufficient for a high clock, thus it reduces the core frequency, and the next it judges the load too high for a low clock, thus raising them back up. This is most pronounced on the Power Saver plan, and much less evident in the Balanced plan. It effectively goes away when switching to the High Performance plan.
This swinging core frequency has been responsible this entire damn time for the inconsistent framerates the emulator was reporting and all the hitching it was experiencing as a result when trying to catch up to these frequency changes. It explains why the existence of the always-active net thread, the spinlock frame limiter variant, and a locked-core-frequency-to-max resolved the issue before, as they all eliminated this pendulum effect the cores were experiencing.
As for how to fix it.. tough to say. I had a brief talk with refraction and, well, in my opinion it appears unlikely something might be done about it. I suggested a hack of implementing a dummy thread (and a relevant setting in the emulator to enable it) as a means of giving users the ability for the emulator to keep the CPU wide awake, without having to change the power plan (or adjust OC freq) like I'd have to do, but that doesn't look desirable. Making spinlock an "official" workaround instead seems bad from my end because the EE counter would become useless, always being at 100%.
So it seems that for now I'll just mess with the OC on my end to work around this, and hopefully the knowledge of this very particular and odd situation will remain in people's memories so that they may advise someone else about it should they be so darn unlucky to follow in my footsteps, lol
Lastly, you might also ask why this took so long to figure out! Well, it was suggested, but wasn't all that obvious. The frequency changes were happening so fast but in regular intervals that the tools I was using were not catching them on time to compare with the framrate dips. When a low clock number was reported, it was so out of sync with the dip itself I just glossed it over, and so we ended up testing a bunch of things in vain -- not that I do not appreciate the effort and willingness of everyone to participate and try to narrow it down or reproduce it from their end.
I had to up the clock frequency update rate on the monitor to properly catch this in the act, at which point a few short tests confirmed the suspicion for good. I've always had a knack for coming across unusual issues when it comes to computers!
===================== ORIGINAL MESSAGE BELOW =====================
As the title implies, the emulator suffers from inconsistency in the frame limiter when it is enabled, with the fps dipping below the PAL/NTSC standard despite the emulator having actual leeway to go much faster. This, at present, exists all the way to the latest dev version with absolute certainty.
Reproducing it is easy. Pick a lightweight title, be it PAL or NTSC, and navigate to some spot you prefer. Ensure the title you picked can run well over 100% of the standard framerate when the limiter is off, to ensure your framerate isn't limited by the system performance.
Once you are in a suitable spot, either wait for a bit and monitor the framerate through the OSD or just move the camera around. During my testing, the FPS through the OSD is unstable. While versions from 953 and back are nearly rock solid at 59.94 fps (NTSC), 954 and later show instability with it fluctuating often between the 59-61 range, and occasionally dipping way lower, by as much as 1/3. Often in the 52 range, sometimes even lower. EE usage also seems to spike during those moments for whatever reason, but this behavior does not occur when the limiter is off and the game runs as fast as it is able.
It would appear that the changes merged from 954-955 somehow brought about this situation that affects all games I have tried so far. My testing platform for narrowing down the revision was the original Splinter Cell title, as it is very easy to start the game fresh and get into the game in its training area real quick, then hang from the first wall and just pan the camera around to check the limiter's consistency.
For reference, system specs are:
Windows 10 20H2 i7 8700K @ 4.8 Ghz RTX 2080
refraction told me to link #4214 here so I hope this looks right. If any pictures for reference are needed let me know and I'll procure a couple, or a video maybe.
Lastly, I should point out that while I do use MTVU, this was tested with it both on and off, and occurs across all renderers. I would assume it also applies to SW but I can't well run games at 60 fps that way so I can't tell for sure.