cyring / CoreFreq

CoreFreq : CPU monitoring and tuning software designed for 64-bit processors.
https://www.cyring.fr
GNU General Public License v2.0
1.98k stars 126 forks source link

Loading corefreqk module utterly breaks my system (Kernel block layer?) #418

Closed WildPenquin closed 1 year ago

WildPenquin commented 1 year ago

For some reason, loading the corefreqk utterly breaks my system after running with the module for a few hours. The system is stable (with uptimes up to several days) when not loading corefreqk module. I'm running on arch linux, on the -zen kernel branch, and I've installed corefreq with the help of AUR packages. I'm running on the B550 chipset (MSI Tomahawk) and 5950X

The symptoms include things which could all be explained by the block layer of the Kernel breaking down such that the Kernel can not anymore read or write on any block devices, including the root filesystem. Logs can not and will not be written to, journalctl can not be run to check the currently running logs. Running simple commands such as cat(from /bin/cat) will result in I/O errors or segmentation faults.

Yesterday, I loaded corefreqk at 17:16:23 and the system broke down at 23:15:00, as at that time the journal ceased to be written to. The system was in an almost unusable state as I've described in the previous paragraph.

I'm a bit reluctant in reproducing the bug since it will take 30 to 300 minutes to trigger, and as it involves the block layer, I fear it might result in data loss (though, so far, I haven't had any data loss because of this bug). But it is indeed the corefreqk module which makes the system unstable, and I can reproduce the bug (eventually, just not at will).

Any ideas as to how to monitor or get useful debug logs before reproducing (if needed)? Would ssh:ing to the computer before triggering and running journalctl -f (or similar) and saving the output on the other computer be a good idea?

cyring commented 1 year ago

corefreqk module. I'm running on arch linux, on the -zen kernel branch

I will first stop on this. I have not tested this branch. If Patches are bringing shared components in conflict then lockups can appear. EDIT: where are those Zen patches ?

The most fragile being SMU where sensors are periodically read from. If received commands are interlaced between two sources (corefreqk.ko and Zen Patches) then it is unknown.

CSR Registers, accessed using index and data registers can't also be interlaced. See CSR Register in amd_reg.h

CoreFreq is pretty exclusive in the way it works. Best prerequisites are the mainstream kernel with most SMU, CSR, MSR, P-State, C-Staes drivers not loaded.

To track your issue, I would suggest ~two~ three approches:

  1. Dump, throw kernel log to an external destination using the crontab
  2. Run in virtualization. Less good because faulty Registers may not be involved in a VM PC.
  3. Boot CoreFreq ISO and let it run for the time required. This will confirm or eliminate the patched kernel assumption.
WildPenquin commented 1 year ago

Hi cyring, thanks for your reply!

One thing before I proceed: browsing previous bug reports I noticed there is the section Software incompatibilities and workarounds. I had k10temp loaded at the same time. Do you think this could be the cause? I am/was using the sensors constantly, as I need that to adjust my cooling system.

But I also found an old bug report where it was claimed k10temp can now be used alongside corefreqk, but it's not in the wiki. Which is the current state of things?

EDIT: As for the Zen patches, I'm not sure there is an easy way to get a list of all the code chances. I'm a bit over my head to know what chances could cause problems. But the upstream code repository is here: https://github.com/zen-kernel/zen-kernel - and there's some information in their Wiki and FAQ (both in the github) about their branches. Also: https://liquorix.net/#features (the Liquorix Kernel is essentially the same AFAICT?)

cyring commented 1 year ago

I'm spending hours of programming with CoreFreq running in parallel without a crash. But mine is a Matisse 3950X and the plain archlinux kernel. No k10temp and other sensors drivers in used.

Thus to be sure of the environment, I would say the best way is to boot my ISO image and let it run for a minimum of 30 minutes.

Please let me know if you can proceed with this ?

cyring commented 1 year ago

@WildPenquin Hello,

Because your kernel flavor is not supported, did you have a chance to run the ISO master live image with your Processor ?

Get ISO at www.cyring.fr

cyring commented 1 year ago

I'm about 4 hours with no crash

2023-03-02-123224_642x410_scrot 2023-03-02-123219_644x1012_scrot

Feel free to provides answers.

cyring commented 1 year ago

Thus I suspect a correlation with your kernel environment. That's why I'm asking you to boot and test with my ISO.

cyring commented 6 months ago

Hi, Are you still facing a crash using latest master branch ? Can you show screenshots if ok ?

WildPenquin commented 6 months ago

Hi,

Given all the hassle I've given up on trying to use corefreqk. The last time I tried, yes, I did get the issue on all Arch Kernel flavors I tried; as I'm a bit time-constrained, I don't have the time to debug this in the near future.