Frogging-Family / linux-tkg

linux-tkg custom kernels
GNU General Public License v2.0
1.25k stars 157 forks source link

Hangs on full load on processors with e-cores #963

Open SheMelody opened 4 days ago

SheMelody commented 4 days ago

OS: Arch Linux Using linux-tkg causes the system to hang when running in full load for prolonged periods of time on processors with e-cores. Disabling e-cores from BIOS fixes the issue. This is not a hardware issue as the other kernels I have tested (arch, linux-xanmod, linux-zen) do not have this problem.

Tk-Glitch commented 4 days ago

Can we please have more data? Compiler used, uarch optimization selected, CPU scheduler?

I'd try changing _custom_commandline="intel_pstate=passive kernel.split_lock_mitigate=0" in customization.cfg (or your frogminer cfg if you're using that) to _custom_commandline="intel_pstate=active kernel.split_lock_mitigate=0". If it fixes your issue (which is likely if you're not using a fancy CPU scheduler) it's technically a hardware issue, but to be fair all Intel big.LITTLE CPUs are basically broken hardware from factory so not exactly unexpected, and the workaround is totally acceptable from a user's standpoint imho.

SheMelody commented 3 days ago

Can we please have more data? Compiler used, uarch optimization selected, CPU scheduler?

I'd try changing _custom_commandline="intel_pstate=passive kernel.split_lock_mitigate=0" in customization.cfg (or your frogminer cfg if you're using that) to _custom_commandline="intel_pstate=active kernel.split_lock_mitigate=0". If it fixes your issue (which is likely if you're not using a fancy CPU scheduler) it's technically a hardware issue, but to be fair all Intel big.LITTLE CPUs are basically broken hardware from factory so not exactly unexpected, and the workaround is totally acceptable from a user's standpoint imho.

I used the latest compiled Release package for Arch, which still doesn't matter to be honest, since I've found what the problem is.

Both Intel and AMD have been silently changing the x86 standard, and that could be considered a "hardware issue", and this is a blatant problem when it comes to ccx, e-cores and all of that.

I have tested a bit more, and apparently it's an issue with voltages (not on my side, and not a hardware fault). I've been tampering with a lot of systems for years, and one thing I know for sure is all processors will just randomly hang if they are not given enough voltage and they try to suddenly transition to a lower idle state (ie. with a higher frequency and voltage).

Raising the Load Line Calibration to Level 8, which decreases the VDroop during power transitions, effectively fixes this problem, but it causes almost 90 °C temperature on full load with a high end liquid cooler on my 14700K. And no, this is still not a hardware issue, and I certainly don't want to run the CPU at 90 °C on full load with my computer sounding like a jet.

intel_pstate can still handle these power transitions just fine with all governors down to level 2 LLC, which is a pretty good result. It is pretty easy and straight-forward to assume that intel_cpufreq does something wrong with these processors when transitioning power states, and disabling e-cores helps to mitigate it and reduces the chance of this happening. The fact that raising LLC also fixes the problems speaks loud that this is a voltage transitioning problem. Whatever intel_cpufreq is doing, it's doing it in a wrong way in this kernel.

Once asserted, I tried this on my other system that I mainly use for AI, which has a 14900K, and I've got the same exact results.

Now this being said, if your concern is reducing stutters, you're taking the wrong path. You only need to address poorly implemented power management on motherboard's side, everything else can be left alone. Disable ASPM, PCI port power management and Advanced Power Management. Components will still idle, this does not control idling of your components, it just affects motherboard's power management. This, together with split_lock_detect=0, kernel.split_lock_mitigate=0 and some other tweaks on other modules, is more than enough to get an extremely stable experience on literally any game.

The following video shows Borderlands 3 flawlessly running on Arch Linux on Ultra settings without stutters: https://www.youtube.com/watch?v=hDWkD3LoKr0

This is all I had to say and I hope it helps with development.

Tk-Glitch commented 3 days ago

We are not touching intel_cpufreq, intel_pstate, nor any form of voltage scaling in any way though. All Intel big.LITTLE CPUs have voltage/frequency transition issues with E-cores enabled, or at least they do on most motherboards (my assumption being it's more of a firmware issue). This is most likely exacerbated by our aggressive ondemand governor which is the only possible culprit - again as long as you're using the stock CPU scheduler (EEVDF). Using a different governor or the fake governors from intel_pstate will give you the exact same behavior as stock kernel. Non-big.LITTLE Intel CPUs (tested on xeons and older mainstream series) aren't affected. AMD CPUs don't have such an issue either.

SheMelody commented 3 days ago

We are not touching intel_cpufreq, intel_pstate, nor any form of voltage scaling in any way though. All Intel big.LITTLE CPUs have voltage/frequency transition issues with E-cores enabled, or at least they do on most motherboards (my assumption being it's more of a firmware issue). This is most likely exacerbated by our aggressive ondemand governor which is the only possible culprit - again as long as you're using the stock CPU scheduler (EEVDF). Using a different governor or the fake governors from intel_pstate will give you the exact same behavior as stock kernel. Non-big.LITTLE Intel CPUs (tested on xeons and older mainstream series) aren't affected. AMD CPUs don't have such an issue either.

The aggressive ondemand governor indeed boosts the problem. Still, intel_cpufreq works really trash on processors with e-cores, in general, even on other kernels, I don't really see a reason to use it on such processors. I understand that, as I also said earlier, companies like Intel and AMD have been silently and heavily changing the x86/ACPI standards, but we, as users, even if extremely knowledgeable, can't really do anything about it.

Your kernel works fine when using intel_pstate either way and there's really no other viable alternative on such very recent processors, unless you're using something up to Intel 11th gen (or up to Ryzen 5000 when it comes to other issues that I'm not going to mention because that would lead us off-topic).

Try to test disabling motherboard's dumb power management, that's what really causes microstutters and stuttering, notably on MSI, ASRock and Gigabyte boards. Generally, I disable those and leave everything else alone (except a few minor tweaking of course) and everything is buttery smooth.

Tk-Glitch commented 3 days ago

Still, intel_cpufreq works really trash on processors with e-cores, in general, even on other kernels

You're absolutely right.

I do have a couple Gigabytes boards around to test with, I'll try to check this out. I'll need to borrow some 13/14th gen mainstream CPU though ^^'

SheMelody commented 3 days ago

Still, intel_cpufreq works really trash on processors with e-cores, in general, even on other kernels

You're absolutely right.

I do have a couple Gigabytes boards around to test with, I'll try to check this out. I'll need to borrow some 13/14th gen mainstream CPU though ^^'

I have plenty of systems where I tested quite a lot of kernel parameters, some motherboards were fine by default when it comes to microstutters, while most of them weren't.

As a small example, on a system with i5-7400 + MSI board it was microstuttering, especially in VKD3D, until I turned all the motherboard's power management off, while on a system with i7-10700F + ASUS board there were no relevant stutters by default, even with ASPM and power managements enabled all along.

The worst case I've seen is my boyfriend's system, which has a i7-14700K and a Gigabyte board, stutters heavily unless disabling all motherboard's power managements, otherwise games are literally unplayable in there. On my system, which has a 14700K + MSI board it microstutters when using motherboard's power management, just not as bad as my boyfriend's system does.

So yes, I'd definitely give that a try.