cyring / CoreFreq

CoreFreq : CPU monitoring and tuning software designed for 64-bit processors.
https://www.cyring.fr
GNU General Public License v2.0
1.97k stars 126 forks source link

CoreFreq not displaying any data on AMD Threadripper 3970X system #327

Closed connebeest closed 2 years ago

connebeest commented 2 years ago

CoreFreq CLI is not displaying any data, even though kernel module is loaded and daemon running.

image

sudo ./corefreqd -d

`CoreFreq Daemon 1.89.4 Copyright (C) 2015-2022 CYRIL INGENIERIE

Processor [AMD Ryzen Threadripper 3970X 32-Core Processor] Architecture [Zen2/Castle Peak] 64/64 CPU Online. SleepInterval(1000), SysGate(2000), 2326 tasks

CPU #000 @ 3693.08 MHz
CPU #001 @ 3693.08 MHz
CPU #002 @ 3693.07 MHz
CPU #003 @ 3693.08 MHz
CPU #004 @ 3693.06 MHz
CPU #005 @ 3693.05 MHz
CPU #006 @ 3693.07 MHz
CPU #007 @ 3693.06 MHz
CPU #008 @ 3693.07 MHz
CPU #009 @ 3693.06 MHz
CPU #010 @ 3693.07 MHz
CPU #011 @ 3693.05 MHz
CPU #012 @ 3693.06 MHz
CPU #013 @ 3693.06 MHz
CPU #014 @ 3693.06 MHz
CPU #015 @ 3693.06 MHz
CPU #016 @ 3693.07 MHz
CPU #017 @ 3693.06 MHz
CPU #018 @ 3693.06 MHz
CPU #019 @ 3693.07 MHz
CPU #020 @ 3693.07 MHz
CPU #021 @ 3693.06 MHz
Thread [7efcae525700] Init CYCLE 015
Thread [7efcaed26700] Init CYCLE 014
Thread [7efcafd28700] Init CYCLE 012
Thread [7efcaf527700] Init CYCLE 013
CPU #022 @ 3693.06 MHz
Thread [7efcb4d32700] Init CYCLE 002
CPU #023 @ 3693.06 MHz
Thread [7efcb4531700] Init CYCLE 003
Thread [7efcb5533700] Shutdown CYCLE 001
CPU #024 @ 3693.06 MHz
CPU #025 @ 3693.06 MHz
CPU #026 @ 3693.08 MHz
CPU #027 @ 3693.06 MHz
CPU #028 @ 3693.06 MHz
CPU #029 @ 3693.06 MHz
CPU #030 @ 3693.06 MHz
CPU #031 @ 3693.06 MHz
CPU #032 @ 3693.08 MHz
CPU #033 @ 3693.08 MHz
CPU #034 @ 3693.08 MHz
CPU #035 @ 3693.08 MHz
CPU #036 @ 3693.08 MHz
CPU #037 @ 3693.08 MHz
CPU #038 @ 3693.07 MHz
CPU #039 @ 3693.08 MHz
CPU #040 @ 3693.08 MHz
CPU #041 @ 3693.08 MHz
CPU #042 @ 3693.07 MHz
Thread [7efcb0529700] Init CYCLE 011
Thread [7efcb1d2c700] Init CYCLE 008
CPU #043 @ 3693.07 MHz
Thread [7efca5d14700] Shutdown CHILD 032
Thread [7efca5513700] Shutdown CHILD 033
Thread [7efc92ffd700] Init CYCLE 043
Thread [7efc98d1a700] Init CYCLE 040
Thread [7efcb5d34700] Shutdown CHILD 000
Thread [7efcabd20700] Init CYCLE 020
Thread [7efc93fff700] Init CYCLE 041
Thread [7efcb5533700] Shutdown CHILD 001
Thread [7efcb3d30700] Init CYCLE 004
Thread [7efc9951b700] Init CYCLE 039
Thread [7efca6515700] Init CHILD 031
Thread [7efcb5d34700] Shutdown CYCLE 000
Thread [7efca6d16700] Init CHILD 030
Thread [7efc937fe700] Init CYCLE 042
Thread [7efcad523700] Init CYCLE 017
Thread [7efcac521700] Init CYCLE 019
Thread [7efcb152b700] Init CYCLE 009
Thread [7efcacd22700] Init CYCLE 018
Thread [7efca1ffb700] Shutdown CYCLE 032
Thread [7efcaa51d700] Init CYCLE 024
Thread [7efca9d1c700] Init CYCLE 025
Thread [7efc9b51f700] Init CYCLE 021
Thread [7efca951b700] Init CYCLE 026
Thread [7efcb4d32700] Init CHILD 002
Thread [7efca27fc700] Init CYCLE 031
Thread [7efca8d1a700] Init CYCLE 027
Thread [7efca37fe700] Init CYCLE 029
Thread [7efca3fff700] Init CYCLE 028
Thread [7efcb4531700] Init CHILD 003
Thread [7efca2ffd700] Init CYCLE 030
Thread [7efc9ad1e700] Init CYCLE 036
Thread [7efcb3d30700] Init CHILD 004
Thread [7efc9a51d700] Init CYCLE 037
Thread [7efcb352f700] Init CHILD 005
Thread [7efc9bfff700] Init CYCLE 035
Thread [7efcb2d2e700] Init CHILD 006
Thread [7efca17fa700] Shutdown CYCLE 033
Thread [7efc99d1c700] Init CYCLE 038
Thread [7efcab51f700] Init CYCLE 022
Thread [7efcb352f700] Init CYCLE 005
Thread [7efcb0d2a700] Init CHILD 010
Thread [7efcb252d700] Init CHILD 007
Thread [7efcb252d700] Init CYCLE 007
Thread [7efcb152b700] Init CHILD 009
Thread [7efcb1d2c700] Init CHILD 008
Thread [7efcb0d2a700] Init CYCLE 010
Thread [7efcaed26700] Init CHILD 014
Thread [7efcb0529700] Init CHILD 011
Thread [7efcacd22700] Init CHILD 018
Thread [7efcab51f700] Init CHILD 021
Thread [7efca8d1a700] Init CHILD 026
Thread [7efcaf527700] Init CHILD 013
Thread [7efcac521700] Init CHILD 019
Thread [7efcad523700] Init CHILD 017
Thread [7efcaad1e700] Init CHILD 022
Thread [7efca951b700] Init CHILD 025
Thread [7efca8519700] Init CHILD 027
Thread [7efcafd28700] Init CHILD 012
Thread [7efcabd20700] Init CHILD 020
Thread [7efca7d18700] Init CHILD 028
Thread [7efca7517700] Init CHILD 029
Thread [7efc9e7fc700] Init CHILD 039
Thread [7efcadd24700] Init CHILD 016
Thread [7efcaa51d700] Init CHILD 023
Thread [7efcadd24700] Init CYCLE 016
Thread [7efc9ffff700] Init CHILD 036
Thread [7efc9effd700] Init CHILD 038
Thread [7efcaad1e700] Init CYCLE 023
Thread [7efca9d1c700] Init CHILD 024
Thread [7efca4d12700] Init CHILD 035
Thread [7efcae525700] Init CHILD 015
Thread [7efc9dffb700] Init CHILD 040
Thread [7efc9d7fa700] Init CHILD 041
Thread [7efc9f7fe700] Init CHILD 037
Thread [7efc1bfff700] Init CHILD 042
Thread [7efc1b7fe700] Init CHILD 043
Thread [7efc1affd700] Init CHILD 044
Thread [7efc1a7fc700] Init CHILD 045
Thread [7efc197fa700] Init CHILD 047
Thread [7efc19ffb700] Init CHILD 046
Thread [7efc18ff9700] Init CHILD 048
Thread [7efbebfff700] Init CHILD 049
Thread [7efbeb7fe700] Init CHILD 050
Thread [7efbeaffd700] Init CHILD 051
Thread [7efbea7fc700] Init CHILD 052
Thread [7efbe9ffb700] Init CHILD 053
Thread [7efbe97fa700] Init CHILD 054
Thread [7efbe8ff9700] Init CHILD 055
Thread [7efbc3fff700] Init CHILD 057
Thread [7efbcbfff700] Init CHILD 056
Thread [7efbcb7fe700] Init CHILD 058
Thread [7efbcaffd700] Init CHILD 059
Thread [7efbca7fc700] Init CHILD 060
Thread [7efbc9ffb700] Init CHILD 061
Thread [7efbc97fa700] Init CHILD 062
Thread [7efc9cd12700] Init CHILD 034
Thread [7efca0ff9700] Init CYCLE 034
Thread [7efbc8ff9700] Init CHILD 063
Thread [7efcb2d2e700] Init CYCLE 006
CPU #044 @ 3693.07 MHz
CPU #045 @ 3693.08 MHz
Thread [7efc927fc700] Init CYCLE 044
CPU #046 @ 3693.08 MHz
CPU #047 @ 3693.08 MHz
Thread [7efc917fa700] Init CYCLE 046
Thread [7efc91ffb700] Init CYCLE 045
Thread [7efc90ff9700] Init CYCLE 047
CPU #048 @ 3693.08 MHz
Thread [7efc4bfff700] Init CYCLE 048
CPU #049 @ 3693.08 MHz
CPU #050 @ 3693.08 MHz
Thread [7efc4b7fe700] Init CYCLE 049
Thread [7efc4affd700] Init CYCLE 050
CPU #051 @ 3693.08 MHz
CPU #052 @ 3693.08 MHz
Thread [7efc4a7fc700] Init CYCLE 051
CPU #053 @ 3693.08 MHz
Thread [7efc49ffb700] Init CYCLE 052
Thread [7efc497fa700] Init CYCLE 053
CPU #054 @ 3693.08 MHz
CPU #055 @ 3693.08 MHz
Thread [7efc48ff9700] Init CYCLE 054
Thread [7efc43fff700] Init CYCLE 055
CPU #056 @ 3693.08 MHz
CPU #057 @ 3693.08 MHz
Thread [7efc437fe700] Init CYCLE 056
CPU #058 @ 3693.07 MHz
Thread [7efc42ffd700] Init CYCLE 057
Thread [7efc427fc700] Init CYCLE 058
CPU #059 @ 3693.09 MHz
Thread [7efc41ffb700] Init CYCLE 059
CPU #060 @ 3693.07 MHz
CPU #061 @ 3693.08 MHz
Thread [7efc417fa700] Init CYCLE 060
CPU #062 @ 3693.08 MHz
Thread [7efc40ff9700] Init CYCLE 061
Thread [7efc3bfff700] Init CYCLE 062
CPU #063 @ 3693.07 MHz
Thread [7efc3b7fe700] Init CYCLE 063

`

./corefreq-cli -k

Linux:
|- Release                                                    [5.4.0-94-generic]
|- Version                         [#106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022]
|- Machine                                                              [x86_64]
Memory:
|- Total RAM                                                                0 KB
|- Shared RAM                                                               0 KB
|- Free RAM                                                                 0 KB
|- Buffer RAM                                                               0 KB
|- Total High                                                               0 KB
|- Free High                                                                0 KB
CPU-Freq driver                                               [    acpi-cpufreq]
Governor                                                      [     performance]
CPU-Idle driver                                               [  corefreqk-idle]
|- Idle Limit                                                 <              C2>
   |- State        POLL      C1      C2      C3      C4      C5      C6
   |-           CPUIDLE  I/O-C1  I/O-C2  I/O-C3  I/O-C4  I/O-C5  I/O-C6
   |- Power          -1       0       0       0       0       0       0
   |- Latency         0       1      20      40      60      80     100
   |- Residency       0       2      40      80     120     160     200

Please advise on what could be wrong and how to fix it.

Thanks.

cyring commented 2 years ago

Hello @connebeest

I can see some threads shutting down right after startup. Thus monitoring is frozen but you can still send commands from UI.

Do you have setup some isolcpus or CPUSETS in your system ?

connebeest commented 2 years ago

Hi @cyring,

Yes, that is correct, we are using isolcpus, here is our GRUB cmdline:

/boot/vmlinuz-5.4.0-94-generic root=UUID=6a72aa38-0163-4252-b2c3-b2e33d3ad986 ro recovery nomodeset dis_ucode_ldr amd_iommu=on iommu=pt modprobe.blacklist=nouveau isolcpus=4-31,36-63 nohz=on nohz_full=4-31,36-63 modprobe.blacklist=acpi_cpufreq idle=halt tsc=unstable libata.fua=1 swapaccount=1 vga=normal nofb nomodeset video=vesafb:off i915.modeset=0 systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1

We added modprobe.blacklist=acpi_cpufreq idle=halt tsc=unstable for testing, because we found this in the CoreFreq Q&A. This does not fix the issue though.

Also want to note that we are using isolcpus on a different system, where CoreFreq works perfectly, though it is an older version, 1.76.1. However that system is Intel-based, and running CentOS 7.

Hope this helps. Happy to provide any other useful info.

Thanks!

cyring commented 2 years ago

Hi @cyring,

Yes, that is correct, we are using isolcpus, here is our GRUB cmdline:

/boot/vmlinuz-5.4.0-94-generic root=UUID=6a72aa38-0163-4252-b2c3-b2e33d3ad986 ro recovery nomodeset dis_ucode_ldr amd_iommu=on iommu=pt modprobe.blacklist=nouveau isolcpus=4-31,36-63 nohz=on nohz_full=4-31,36-63 modprobe.blacklist=acpi_cpufreq idle=halt tsc=unstable libata.fua=1 swapaccount=1 vga=normal nofb nomodeset video=vesafb:off i915.modeset=0 systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1

We added modprobe.blacklist=acpi_cpufreq idle=halt tsc=unstable for testing, because we found this in the CoreFreq Q&A. This does not fix the issue though.

Also want to note that we are using isolcpus on a different system, where CoreFreq works perfectly, though it is an older version, 1.76.1. However that system is Intel-based, and running CentOS 7.

Hope this helps. Happy to provide any other useful info.

Thanks!

Isolation

That's unfortunately the issue #326 I have not solved yet.

Workaround is to disable from the UI the unavailable CPUs ; those which are reserved by isolation.

  1. Press shortcut # or go to menu HotPlug CPU 2022-02-25-195012_644x550_scrot

  2. Toggle OFF each CPU concerned by isolation 2022-02-25-195103_644x550_scrot

  3. As soon as the CoreFreq map will be aligned with the kernel isolation map, the monitoring will engage.

CoreFreq as the Clock Source, CPU Freq and CPU Idle driver

This Wiki page also provides instructions to set this use-case.

connebeest commented 2 years ago

Hi @cyring

Thanks for the feedback. The strange thing is, that CoreFreq is also not reporting anything for the CPU's that are not isolated, which are 0-3. Technically 2 & 3 are isolated using cgroups, but 0 & 1 are definitely not. Also, is this an issue specific to AMD CPUs and/or Ubuntu? Because as said, I do not have this is issue on a different system:

Also want to note that we are using isolcpus on a different system, where CoreFreq works perfectly, though it is an older version, 1.76.1. However that system is Intel-based, and running CentOS 7.

Thanks

cyring commented 2 years ago

Hi @cyring

Thanks for the feedback. The strange thing is, that CoreFreq is also not reporting anything for the CPU's that are not isolated, which are 0-3. Technically 2 & 3 are isolated using cgroups, but 0 & 1 are definitely not. Also, is this an issue specific to AMD CPUs and/or Ubuntu? Because as said, I do not have this is issue on a different system:

Also want to note that we are using isolcpus on a different system, where CoreFreq works perfectly, though it is an older version, 1.76.1. However that system is Intel-based, and running CentOS 7.

Thanks

Thanks for your various testings.

I'm still processing how the Kernel is handling isolation but I feel like it structurally change the CPU management. So far I have mastered the Hot-Plugging on a basic OS usage. Hot-Plugging is probably impacted by isolation in its source code.

Still some to study ...

About evaluating your hardware, I'm providing images [ 1 ] [ 2 ] with all prerequisites for CoreFreq. Those automated live CD should help you checking if your Processors are well supported.

cyring commented 2 years ago

Duplicate of #326

connebeest commented 2 years ago

Hi @cyring,

Apologies for the delayed response, and thanks for the info.

We got it working in the end, it has to do with cgroups that are being used on the system, isolcpu has no effect on CoreFreq. The way we get it to work with cgroups is by running the CoreFreq daemon and CLI in the root cgroups cpugroup, by using these commands:

cgexec -g cpuset:/ ./corefreqd -d cgexec -g cpuset:/ ./cofrefreq-cli

Perhaps this also helps for #326.

One thing we noticed though, is that if you run perf top, after having started CoreFreq with the above commands, it will crash the server. Although I'm not sure if that is related to our specific situation, or more generally. If we just run VMs on the system (which is our use-case), it does not crash.

cyring commented 2 years ago

Hi @cyring,

Apologies for the delayed response, and thanks for the info.

We got it working in the end, it has to do with cgroups that are being used on the system, isolcpu has no effect on CoreFreq. The way we get it to work with cgroups is by running the CoreFreq daemon and CLI in the root cgroups cpugroup, by using these commands:

cgexec -g cpuset:/ ./corefreqd -d cgexec -g cpuset:/ ./cofrefreq-cli

Perhaps this also helps for #326.

One thing we noticed though, is that if you run perftop, after having started CoreFreq with the above commands, it will crash the server. Although I'm not sure if that is related to our specific situation, or more generally. If we just run VMs on the system (which is our use-case), it does not crash.

Oh thank you for the cgroup tips.

About perf top issue, it is linked with all symbols and my driver. See issue #214 I've not solved it yet but workaround is to start perf top before loading corefreqk.ko