cyring / CoreFreq

CoreFreq : CPU monitoring and tuning software designed for 64-bit processors.
https://www.cyring.fr
GNU General Public License v2.0
1.97k stars 126 forks source link

Zen/Whitehaven - Inverted TDP values #236

Closed h1z1 closed 2 years ago

h1z1 commented 3 years ago

Just noticed this, not sure if it's a bug in Corefreq or platform?

cf

cyring commented 3 years ago

Just noticed this, not sure if it's a bug in Corefreq or platform?

Not a bug but experimental with your Zen model: https://github.com/cyring/CoreFreq/blob/717e444e89caa0c74038f9c0d09ae68b725007cd/corefreqk.c#L5682

So these SMU addresses only work with Zen 2 & 3: https://github.com/cyring/CoreFreq/blob/717e444e89caa0c74038f9c0d09ae68b725007cd/corefreqk.c#L5618

Zen1: I believe TDP has to be queried on the PM table with mailbox protocol. Do you know any SMU single address to read your TDP from ?

What about the others: Min, Max, PPT, EDC, TDC: do you confirm values are also wrong ?

h1z1 commented 3 years ago

No idea tbh. Wasn't aware TDP could be read, thought it was inferred and even that varied with vendors, implementations and phases of the moon. AMD goes so far as to conflate Thermal TDP with Electrical. GN did an excellent job covering it iirc.

cyring commented 3 years ago

Hello,

Can you please show me what values you are getting from this project ?

https://gitlab.com/CyrIng/ryzen_smu

h1z1 commented 3 years ago

Hi, sorry not sure why this didn't send a notification... hmm

Which values are you looking for, from syslog?

[1343745.127347] ryzen_smu: CPUID: family 0x17, model 0x1, stepping 0x1, package 0x7
[1343745.127917] ryzen_smu: SMU v4.25.118.0
[1343745.128734] ryzen_smu: SMU v4.25.118.0
[1343745.128759] sysfs: cannot create duplicate filename '/kernel/ryzen_smu_drv'
[1343745.128761] CPU: 8 PID: 19460 Comm: insmod Tainted: P           O      5.4.70 #1
[1343745.128763] Hardware name: Gigabyte Technology Co., Ltd. X399 AORUS Gaming 7/X399 AORUS Gaming 7, BIOS F12 12/11/2019

/sys/kernel/ryzen_smu_drv/codename:05 /sys/kernel/ryzen_smu_drv/drv_version:0.1.0 /sys/kernel/ryzen_smu_drv/mp1_if_version:0 /sys/kernel/ryzen_smu_drv/version:4.25.118.0

cyring commented 3 years ago

Hi, sorry not sure why this didn't send a notification... hmm

Which values are you looking for, from syslog?

[1343745.127347] ryzen_smu: CPUID: family 0x17, model 0x1, stepping 0x1, package 0x7
[1343745.127917] ryzen_smu: SMU v4.25.118.0
[1343745.128734] ryzen_smu: SMU v4.25.118.0
[1343745.128759] sysfs: cannot create duplicate filename '/kernel/ryzen_smu_drv'
[1343745.128761] CPU: 8 PID: 19460 Comm: insmod Tainted: P           O      5.4.70 #1
[1343745.128763] Hardware name: Gigabyte Technology Co., Ltd. X399 AORUS Gaming 7/X399 AORUS Gaming 7, BIOS F12 12/11/2019

/sys/kernel/ryzen_smu_drv/codename:05 /sys/kernel/ryzen_smu_drv/drv_version:0.1.0 /sys/kernel/ryzen_smu_drv/mp1_if_version:0 /sys/kernel/ryzen_smu_drv/version:4.25.118.0

Because ryzen_smu implements the mailbox protocol, I'm interested in the TDP, PPT, EDC, TDC In fact the whole output from its monitor UI will be fine.

But apparently you are facing issues to start it up ?

h1z1 commented 3 years ago

Not sure what you mean by start up. Module appeared to load despite the error though none of the utils run...

# ./userspace/monitor_cpu 
rd_buf: 0.1.0
PM Tables are not supported on this platform.
# ./scripts/test.py 
Failed to write SMU arguments
# ./scripts/
cpuid.py          dump_pm_table.py  monitor_cpu.py    __pycache__/      read_dump.py      test.py
# ./scripts/monitor_cpu.py 
PM Table: Unsupported
# ./scripts/dump_pm_table.py 
PM Tables are not supported for this model of processor.
# 

Edit: Just realized what you mean, some sysfs files did not in fact get created in ryazen_smu_drv. pm_table being one.

drwxr-xr-x  2 root root    0 May 19 01:27 .
drwxr-xr-x 14 root root    0 May  3 11:35 ..
-r--------  1 root root 4096 May 19 01:27 codename
-r--------  1 root root 4096 May 19 01:27 drv_version
-r--------  1 root root 4096 May 19 01:27 mp1_if_version
-rw-------  1 root root 4096 May 19 01:27 mp1_smu_cmd
-rw-------  1 root root 4096 May 19 00:52 rsmu_cmd
-rw-------  1 root root 4096 May 19 01:27 smn
-rw-------  1 root root 4096 May 19 01:30 smu_args
-r--------  1 root root 4096 May 19 01:27 version

Couple lines were truncated above too sigh

[1343745.127347] ryzen_smu: CPUID: family 0x17, model 0x1, stepping 0x1, package 0x7
[1343745.127917] ryzen_smu: SMU v4.25.118.0
[1343745.128734] ryzen_smu: SMU v4.25.118.0
[1343745.128759] sysfs: cannot create duplicate filename '/kernel/ryzen_smu_drv'
[1343745.128789]  ryzen_smu_probe+0x114/0x390 [ryzen_smu]
[1343745.128812]  ryzen_smu_driver_init+0x23/0x1000 [ryzen_smu]
[1343745.128841] kobject_add_internal failed for ryzen_smu_drv with -EEXIST, don't try to register things with the same name in the same directory.
[1343745.128843] ryzen_smu: Unable to create sysfs interface
[1343745.128846] ryzen_smu: probe of 0000:40:00.0 failed with error -12
cyring commented 3 years ago

Not sure what you mean by start up. Module appeared to load despite the error though none of the utils run...

# ./userspace/monitor_cpu 
rd_buf: 0.1.0
PM Tables are not supported on this platform.
# ./scripts/test.py 
Failed to write SMU arguments
# ./scripts/
cpuid.py          dump_pm_table.py  monitor_cpu.py    __pycache__/      read_dump.py      test.py
# ./scripts/monitor_cpu.py 
PM Table: Unsupported
# ./scripts/dump_pm_table.py 
PM Tables are not supported for this model of processor.
# 

Edit: Just realized what you mean, some sysfs files did not in fact get created in ryazen_smu_drv. pm_table being one.

drwxr-xr-x  2 root root    0 May 19 01:27 .
drwxr-xr-x 14 root root    0 May  3 11:35 ..
-r--------  1 root root 4096 May 19 01:27 codename
-r--------  1 root root 4096 May 19 01:27 drv_version
-r--------  1 root root 4096 May 19 01:27 mp1_if_version
-rw-------  1 root root 4096 May 19 01:27 mp1_smu_cmd
-rw-------  1 root root 4096 May 19 00:52 rsmu_cmd
-rw-------  1 root root 4096 May 19 01:27 smn
-rw-------  1 root root 4096 May 19 01:30 smu_args
-r--------  1 root root 4096 May 19 01:27 version

Couple lines were truncated above too sigh

[1343745.127347] ryzen_smu: CPUID: family 0x17, model 0x1, stepping 0x1, package 0x7
[1343745.127917] ryzen_smu: SMU v4.25.118.0
[1343745.128734] ryzen_smu: SMU v4.25.118.0
[1343745.128759] sysfs: cannot create duplicate filename '/kernel/ryzen_smu_drv'
[1343745.128789]  ryzen_smu_probe+0x114/0x390 [ryzen_smu]
[1343745.128812]  ryzen_smu_driver_init+0x23/0x1000 [ryzen_smu]
[1343745.128841] kobject_add_internal failed for ryzen_smu_drv with -EEXIST, don't try to register things with the same name in the same directory.
[1343745.128843] ryzen_smu: Unable to create sysfs interface
[1343745.128846] ryzen_smu: probe of 0000:40:00.0 failed with error -12

With my 3950X I have to force loading the software with option -f make sure no other instance of its driver is running or any SMU software is already present

insmod ryzen_smu.ko

ryzen_smu: CPUID: family 0x17, model 0x71, stepping 0x0, package 0x2

./userspace/monitor_cpu -f

╭────────────────────────────────────────────────┬─────────────────────────────────────────────────╮
│                                      CPU Model │             AMD Ryzen 9 3950X 16-Core Processor │
│                            Processor Code Name │                                         Matisse │
│                                          Cores │                                              16 │
│                                      Core CCDs │                                               2 │
│                                      Core CCXs │                                               4 │
│                                  Cores Per CCX │                                               4 │
│                                 SMU FW Version │                                        v46.63.0 │
│                                 MP1 IF Version │                                             v11 │
╰────────────────────────────────────────────────┴─────────────────────────────────────────────────╯
╭─────────┬────────────────┬─────────┬─────────┬─────────┬─────────────┬─────────────┬─────────────╮
│  Core 0 │          1 MHz | 0.003 W | 0.200 V | 30.51 C | C0:   0.0 % | C1:   0.0 % | C6: 100.0 % │
│  Core 1 │          0 MHz | 0.002 W | 0.200 V | 30.32 C | C0:   0.0 % | C1:   0.0 % | C6: 100.0 % │
│  Core 2 │          3 MHz | 0.004 W | 0.201 V | 30.57 C | C0:   0.1 % | C1:   0.0 % | C6:  99.9 % │
│  Core 3 │          1 MHz | 0.003 W | 0.200 V | 30.45 C | C0:   0.0 % | C1:   0.0 % | C6: 100.0 % │
│  Core 4 │          3 MHz | 0.004 W | 0.201 V | 30.43 C | C0:   0.1 % | C1:   0.0 % | C6:  99.9 % │
│  Core 5 │          5 MHz | 0.006 W | 0.201 V | 30.52 C | C0:   0.1 % | C1:   0.0 % | C6:  99.9 % │
│  Core 6 │          0 MHz | 0.003 W | 0.200 V | 30.50 C | C0:   0.0 % | C1:   0.0 % | C6: 100.0 % │
│  Core 7 │          1 MHz | 0.003 W | 0.200 V | 30.42 C | C0:   0.0 % | C1:   0.0 % | C6: 100.0 % │
│  Core 8 │          2 MHz | 0.004 W | 0.201 V | 31.41 C | C0:   0.1 % | C1:   0.0 % | C6:  99.9 % │
│  Core 9 │          2 MHz | 0.008 W | 0.200 V | 31.59 C | C0:   0.1 % | C1:   0.0 % | C6:  99.9 % │
│ Core 10 │          1 MHz | 0.004 W | 0.200 V | 31.40 C | C0:   0.0 % | C1:   0.0 % | C6: 100.0 % │
│ Core 11 │          0 MHz | 0.002 W | 0.200 V | 31.41 C | C0:   0.0 % | C1:   0.0 % | C6: 100.0 % │
│ Core 12 │          2 MHz | 0.002 W | 0.201 V | 31.56 C | C0:   0.1 % | C1:   0.0 % | C6:  99.9 % │
│ Core 13 │         26 MHz | 0.059 W | 0.204 V | 31.77 C | C0:   0.6 % | C1:   0.1 % | C6:  99.4 % │
│ Core 14 │          2 MHz | 0.003 W | 0.201 V | 31.61 C | C0:   0.1 % | C1:   0.0 % | C6:  99.9 % │
│ Core 15 │         12 MHz | 0.013 W | 0.202 V | 31.63 C | C0:   0.3 % | C1:   0.1 % | C6:  99.7 % │
╰─────────┴────────────────┴─────────┴─────────┴─────────┴─────────────┴─────────────┴─────────────╯
╭────────────────────────────────────────────────┬─────────────────────────────────────────────────╮
│                            Peak Core Frequency │                                          26 MHz │
│                               Peak Temperature │                                         45.25 C │
│                                  Package Power │                                       16.8100 W │
│                           Peak Core(s) Voltage │                                      0.255874 V │
│                           Average Core Voltage │                                      0.200765 V │
│                                    Package CC6 │                                     92.028168 % │
│                                       Core CC6 │                                     99.890877 % │
╰────────────────────────────────────────────────┴─────────────────────────────────────────────────╯
╭────────────────────────────────────────────────┬─────────────────────────────────────────────────╮
│                         Thermal Junction Limit │                                         95.00 C │
│                            Current Temperature │                                         32.13 C │
│                                SoC Temperature │                                         28.75 C │
│                                     Core Power │                                        0.1460 W │
│                                      SoC Power │                8.3634 W | 7.6601 A | 1.092350 V │
│                                            PPT │             16.8060 W |     142 W  |    11.84 % │
│                                            TDC │              0.1376 A |      95 A  |     0.14 % │
│                                            EDC │              0.1376 A |     140 A  |     0.10 % │
│                                      FIT Limit │                                      0.243201 % │
╰────────────────────────────────────────────────┴─────────────────────────────────────────────────╯
╭────────────────────────────────────────────────┬─────────────────────────────────────────────────╮
│                                   Coupled Mode │                                              ON │
│                         Fabric Clock (Average) │                                         293 MHz │
│                                   Fabric Clock │                                        1833 MHz │
│                                   Uncore Clock │                                        1833 MHz │
│                                   Memory Clock │                                        1833 MHz │
│                                      VDDCR_Mem │                                        6.5007 W │
│                                      VDDCR_SoC │                                        1.1000 V │
│                                      cLDO_VDDM │                                        0.9504 V │
│                                      cLDO_VDDP │                                        0.9976 V │
│                                      cLDO_VDDG │                                        1.0477 V │
╰────────────────────────────────────────────────┴─────────────────────────────────────────────────╯
h1z1 commented 3 years ago

No change with -f. Doesn't appear that supports Threadripper? There is no case for it in smu_init

cyring commented 3 years ago

No change with -f. Doesn't appear that supports Threadripper? There is no case for it in smu_init

Perhaps, you may have a better chance with the original ryzen_smu project ?

h1z1 commented 3 years ago
* [`CODENAME_THREADRIPPER`](https://gitlab.com/CyrIng/ryzen_smu/-/blob/smu_debug/smu.c#L212)

Neither of the references I could find actually initialize it though: int smu_resolve_cpu_class

smu_init has no case:

 switch (g_smu.codename) {
        case CODENAME_SUMMITRIDGE:
        case CODENAME_THREADRIPPER:
        case CODENAME_PINNACLERIDGE:
            g_smu.addr_rsmu_mb_cmd  = 0x3B1051C;
            g_smu.addr_rsmu_mb_rsp  = 0x3B10568;
            g_smu.addr_rsmu_mb_args = 0x3B10590;
            goto LOG_RSMU;
cyring commented 3 years ago

Hello,

Latest develop branch is bringing the HSMP mailbox which is specified for the AMD Family 19h Model 01h Fortunately it works with my 3950X. See this discussion.

I would like to know if it works with your Threadripper ?

h1z1 commented 3 years ago

Hi

Would depend on what you mean by works :) TDP is reported as missing now

236-02

FWIW AMD as part of their boosting intentionally over reports temperatures so the OS/BIOS preemtively increases cooling.

   Thermal Design Power       Platform   [Disable] 
   Power Limit                    PL1   [Missing]        
   Power Limit                    PL2   [Missing]        
   Package Power Tracking          PPT   [Missing]       
   Electrical Design Current       EDC   [Missing]       
   Thermal Design Current          TDC   [Missing]       
cyring commented 3 years ago

Sorry I forgot to add you need to start the driver with the Experimental mode.

insmod corefreqk.ko Experimental=1
h1z1 commented 3 years ago

Eeep thought I did but nope. Sorry.

Values still appear backward though

blah

cyring commented 3 years ago

I really don't find any good registers for Ryzen, Threadripper, EPYC of generations Zen1 & Zen+

I was hoping that the HSMP protocol to the SMU will provide a universal mean to query those Power values. Unfortunately, HSMP is specified within the Zen3 PPR documentation. I'm lucky it works with Zen2/Matisse, however I still have not received any report from CoreFreq Users, owner of others APU, TR, EPYC generation 2 and 3

It's up to you to close or not this issue, but I have to admit that I'm facing the limits of the AMD public specifications.

cyring commented 3 years ago

I'm still searching on the subject: do you see any option in your BIOS to activate HSMP or names like BMC ?

cyring commented 3 years ago

According to this readme https://github.com/AMDESE/libhsmp/commit/a3bb6efac327645a18835b2efcc542a12d49012c can you confirm the HSMP state ?

cyring commented 3 years ago

113583

h1z1 commented 3 years ago

I don't see anything about those but then again it's not exactly a server board (especially wrt the BMC).

cyring commented 3 years ago

I have updated BIOS to 3703 and I now see a value change in PL2 CoreFreq_C8HW_3703

h1z1 commented 3 years ago

Which motherboard is that with?

cyring commented 3 years ago

Which motherboard is that with?

This 395OX setup https://github.com/cyring/CoreFreq/wiki/CoreFreq-Lab

cyring commented 3 years ago

I tried today Ryzen Master (Windows) and learned from it that 395W is the Max power PPT RM-3307-MAX-PWR-395-2021-08-15 This confirmed for my 3950X that PL2 = 395 W is the value read from HSMP when PPT is left as AUTO into BIOS.

Your WhiteHaven should have other Max value(s) when PPT is AUTO. You may read this using Ryzen Master. Please let me know what you get ?

cyring commented 2 years ago

Just noticed this, not sure if it's a bug in Corefreq or platform?

cf

I have noticed with PBO enabled and Power Limits as AUTO that my 3950X is showing some specific constant values:395W for PL1 and PL2

2021-09-17-011103_413x380_scrot

I wonder if you are getting the same behavior but with different constants ?

cyring commented 2 years ago

HSMP is now a per model feature.

h1z1 commented 2 years ago

Still haven't had a chance to install windows on these boards to test sorry. With latest head though they appear the same ?

blarg2

h1z1 commented 2 years ago

Was chasing some crashes that I"m not entirely sure are directly related to corefreq or not but I can confirm at least that no amount of disabling c6/power states appear to have any impact. Values are still inverted shrug

The other crashes happen with kernel tracing and corefreq. It's unfortunately a hard lock and only appears to happen in X. Maybe a race somewhere

cyring commented 2 years ago

Was chasing some crashes that I"m not entirely sure are directly related to corefreq or not but I can confirm at least that no amount of disabling c6/power states appear to have any impact. Values are still inverted shrug

The other crashes happen with kernel tracing and corefreq. It's unfortunately a hard lock and only appears to happen in X. Maybe a race somewhere

h1z1 commented 2 years ago

Don't think so from what I gather of varorum and a tool from amd. NGL, Corefreq runs circles around both. https://github.com/amd/esmi_oob_library looks neat too but limited.

cyring commented 2 years ago

Don't think so from what I gather of varorum and a tool from amd. NGL, Corefreq runs circles around both. https://github.com/amd/esmi_oob_library looks neat too but limited.

Can you plz post the output of what you're getting from those software ?