cyring / CoreFreq

CoreFreq : CPU monitoring and tuning software designed for 64-bit processors.
https://www.cyring.fr
GNU General Public License v2.0
1.97k stars 126 forks source link

sporadic system freeze - kernel 5.15+ #343

Closed lhommev closed 2 years ago

lhommev commented 2 years ago

Hi,

First, i'd like to say i really like the tool. Congrats for the achievements!

I experience system freeze (required hard reset) using corefreq with kernels 5.15 onwards. hardware: zen3 ryzen 5950x, x570 motherboard, linux ubuntu 22. The issue doesn't look like it s related and it took me months to narrow it down to this software. I lost some hairs..

Symptoms:

What is NOT having effect:

What is doing something:

It sounds like some very complicated corner case.

cyring commented 2 years ago

Hi,

First, i'd like to say i really like the tool. Congrats for the achievements!

I experience system freeze (required hard reset) using corefreq with kernels 5.15 onwards. hardware: zen3 ryzen 5950x, x570 motherboard, linux ubuntu 22. The issue doesn't look like it s related and it took me months to narrow it down to this software. I lost some hairs..

Symptoms:

  • Freeze sometimes when opening a firefox tab. Some sites are more likely to trigger it. The more embedded videos, the more probable is the trigger.
  • Freeze sometimes when opening photo editing software
  • Freeze sometimes when opening videos
  • Freeze 50% times after system wakeup The freeze happens every 10-20 minutes in my regular workflow.

What is NOT having effect:

  • swapping hardware
  • disabling power states in bios
  • disabling power states in linux command line
  • actually any firmware setting except SMT
  • forcing CPU frequency
  • system load

What is doing something:

  • Disabling SMT (simultaneous multithreading) in BIOS -> 100% stable
  • Running linux 20.04 , kernels up to 5.11 -> 100% stable
  • Running without corefreqk loaded -> 100% stable

It sounds like some very complicated corner case.

Hello,

Do you have any other SMU drivers running simultaneously with CoreFreq ? SMU conflicts will happen with k10temp , ryzen_smu, ryzen_nb_smu and any SW which are reading/writing the SMU Do a lsmod to find them and unload them before CoreFreq

lhommev commented 2 years ago

Yes k10temp is loaded.

cyring commented 2 years ago

Yes k10temp is loaded.

Please remove/unload it and use CoreFreq with a full SMU control.

For your information, the SMU component is providing various registers: temperature, voltage, power and so on.

SMU offers a select register and a data register:

  1. Sofware writes a targetted register into the SMU select port
  2. and then reads/writes its value through the SMU data port.

If many SW are selecting/reading/writing simultaneously then the SMU gets lost and freezes. It may takes a while for such conflict to happen.

The solution provided by Linux kernel is to protect the SMU accesses through a mutex. CoreFreq does not use this solution because its CPU monitoring runs in an interrupt context where no blocking call (mutex) is acceptable.

See Wiki/k10temp-and-zenpower-kernel-modules

So CoreFreq has to drive the AMD/Zen/SMU; exclusively.

lhommev commented 2 years ago

I see thanks. It makes sense. As an engineer, i see this incompatibility as quite hazardous. IMO, Corefreq should not bypass the mutex, assuming the users took care about the software compatibility issues. I mean, most user wont know.

Does corefreq perform write operations to the SMU registers?

cyring commented 2 years ago

I see thanks. It makes sense. As an engineer, i see this incompatibility as quite hazardous. IMO, Corefreq should not bypass the mutex, assuming the users took care about the software compatibility issues. I mean, most user wont know.

Does corefreq perform write operations to the SMU registers?

Just a few of them (starts at Core_AMD_SMN_Write()).

Exemples:

cyring commented 2 years ago

... Corefreq should not bypass the mutex ...

I won't even try to use a mutex in a kernel high resolution timer handler (hrtimers)

cyring commented 2 years ago

Hello,

If you find the master branch stable, including the SMU conflict recommendations, then feel free to close the issue ?

I will also appreciate a 5950X report. As an example, my 3950X page

Your report will then be added to the CPU support list.

Thank you, CyrIng

lhommev commented 2 years ago

ok. I ll update the CPU list.

regards, Vincent