cyring / CoreFreq

CoreFreq : CPU monitoring and tuning software designed for 64-bit processors.
https://www.cyring.fr
GNU General Public License v2.0
1.97k stars 126 forks source link

[Solved] CoreFreq gets stuck at 100% CPU load when kernel module is not active #352

Closed Betaminos closed 2 years ago

Betaminos commented 2 years ago

Hello there,

I seem to have stumbled across one or two bugs. My device is based around an Intel 1165G7 and I am running Arch Linux, up to date as of now, with the default kernel (5.18.14-arch1-1.1). (CoreFreq did previously - as in a few months back - work on this device, but I have no idea when it stopped working or what has changed) I have installed CoreFreq via yay and have selected the non-git versions: yay corefreq-client corefreq-server corefreq-dkms

When I do the "sudo modprobe corefreqk", I only get a "Killed" as reply. However, startig corefreqd seems to work just fine, as it starts and stays up:

[beta@oxp ~]$ sudo systemctl status corefreqd ● corefreqd.service - CoreFreq Daemon Loaded: loaded (/usr/lib/systemd/system/corefreqd.service; disabled; preset: disabled) Active: active (running) since Thu 2022-07-28 18:01:35 +08; 1min 43s ago Main PID: 11602 (corefreqd-pmgr) Tasks: 12 (limit: 18904) Memory: 948.0K CPU: 36ms CGroup: /system.slice/corefreqd.service ├─11602 corefreqd -q └─11603 corefreqd -q

Jul 28 18:01:35 oxp systemd[1]: Started CoreFreq Daemon.

However, upon starting "corefreq-cli", the terminal freezes, in the sense of no longer accepting any input - no [CTRL]+[C] or the likes do anything. Checking on the process with htop shows that the process is eating a full core, pegging it at 100% indefinitely. I need to forcefully kill corefreq-cli via a SIGTERM to get rid of it.

I would therefore like to propose that corefreq-cli gets a kind of timeout, if it is not able to start up or connect to the kernel module. Alternatively, (going on my gut feeling, sorry if I am completely wrong with this) the 100% CPU load seems to hint at the program getting stuck in a loop and trying to do the same thing indefinitely. If this is the case, getting it to stop with an error message after a set amount of failed attempts would be able to help the user in narrowing down what is happening.

Please let me know if I can provide any additional information that can be of help here.

Thank you for your time.

cyring commented 2 years ago

Hello,

Driver corefreqk.ko may have crashed for some reasons.

I will need any trace of driver crash from the kernel log (dmesg)

and the CoreFreq version (corefreq-cli -v)

cyring commented 2 years ago

I can't reproduce your issue with [11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80

2022-07-29-061640_644x550_scrot

Try running CoreFreq manually without any package you'll have to remove before the bellow procedure:

  1. git clone https://github.com/cyring/CoreFreq.git
  2. cd CoreFreq
  3. make DELAY_TSC=1 clean all

Open Terminal as root to start Driver then Daemon

insmod corefreqk.ko
./corefreqd -d

Open Terminal as a User to start Client

./corefreq-cli 
Betaminos commented 2 years ago

I have uninstalled my previous CoreFreq installation, cloned the git and built it using your commands. Afterwards, I did a reboot, logged in and tried to insert the module:

[beta@oxp CoreFreq]$ sudo insmod corefreqk.ko [sudo] password for beta: Killed

I have dumped my dmesg into a log file and have attached it: dmesg.log

Please let me know if I can provide anything else.

cyring commented 2 years ago

SMBIOS_Decoder appears to fail on your BIOS

I'll ask to comment that function call in corefreqk.c driver source code; rebuild and test again.

cyring commented 2 years ago

Go to this line: https://github.com/cyring/CoreFreq/blob/a485cecf002c950c97955e81d20d87059115f7f2/corefreqk.c#L21586

and comment function as below:


    /*  Copy various SMBIOS data [version 3.2]          */
    SMBIOS_Collect();
/*
    SMBIOS_Decoder();
*/

then rebuild and test

Betaminos commented 2 years ago

Commenting out line 21586 and rebuilding CoreFreq does successfully work around the crash- i.e. it loads and runs fine. Is there any help I can provide related to narrowing it down further or making the application more resistant towards this error?

cyring commented 2 years ago

Commenting out line 21586 and rebuilding CoreFreq does successfully work around the crash- i.e. it loads and runs fine. Is there any help I can provide related to narrowing it down further or making the application more resistant towards this error?

Thanks a lot for your help.

I have to simulate/fake your setup to reproduce the issue.

According to kernel log, somewhere strlen fails and I presume this happens around those lines: https://github.com/cyring/CoreFreq/blob/a485cecf002c950c97955e81d20d87059115f7f2/corefreqk.c#L20974

cyring commented 2 years ago

Commenting out line 21586 and rebuilding CoreFreq does successfully work around the crash- i.e. it loads and runs fine. Is there any help I can provide related to narrowing it down further or making the application more resistant towards this error?

Thanks a lot for your help.

I have to simulate/fake your setup to reproduce the issue.

According to kernel log, somewhere strlen fails and I presume this happens around those lines: https://github.com/cyring/CoreFreq/blob/a485cecf002c950c97955e81d20d87059115f7f2/corefreqk.c#L20974

cyring commented 2 years ago

Can you try this attached version ? CoreFreq_develop.tar.gz

Betaminos commented 2 years ago

Can you try this attached version ? CoreFreq_develop.tar.gz

This version compiles, inserts and runs seemingly fine as well. The daemon outputs this on start:

[beta@oxp CoreFreq]$ sudo ./corefreqd -d CoreFreq Daemon 1.91.4 Copyright (C) 2015-2022 CYRIL INGENIERIE

Processor [11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz] Architecture [Tiger Lake/U] 4/4 CPU Online. SleepInterval(1000), SysGate(2000), 2335 tasks

CPU #000 @ 2803.01 MHz
CPU #001 @ 2803.16 MHz
CPU #002 @ 2802.98 MHz
CPU #003 @ 2803.19 MHz
Thread [7f1fd37b6640] Init CHILD 003
Thread [7f1fd37b6640] Init CYCLE 003
Thread [7f1fd47b8640] Init CYCLE 001
Thread [7f1fd3fb7640] Init CHILD 002
Thread [7f1fd47b8640] Init CHILD 001
Thread [7f1fd3fb7640] Init CYCLE 002
Thread [7f1fd4fb9640] Init CHILD 000
Thread [7f1fd4fb9640] Init CYCLE 000
    NTFY || ....
    RING[1](c60d,0)(0:0,0:8aa)

And the SMBIOS Data is populated nicely as well. image

cyring commented 2 years ago

Great !

Are you also decoding the DIMM manufacturer and identifiers ? Like G Skill Intl and F4-3600C16-16GTZN bellow 2022-07-29-114527_644x1012_scrot which is also printed in IMC window

Betaminos commented 2 years ago

I am not sure about this. The SMBIOS lists this: image

Memory Controller image

However, this is a Chinese handheld laptop with soldered RAM, so it could very well be that this information is not present in the device. Due to this, I would assume that the output is as good as it gets, but I will gladly test whatever you need to improve your program.

Thank you for the great support!

cyring commented 2 years ago

I am not sure about this. The SMBIOS lists this: image

Memory Controller image

However, this is a Chinese handheld laptop with soldered RAM, so it could very well be that this information is not present in the device. Due to this, I would assume that the output is as good as it gets, but I will gladly test whatever you need to improve your program.

Thank you for the great support!

OK, I see, we are getting DIMM Manufacturer but not the IDs which are now substitute with an empty string.

Thank you for helping troubleshooting this issue Time to release bug fix

cyring commented 2 years ago

CoreFreq ArchLinux packages have been updated, I think you can now install from the manager (and close this issue if everything is OK)