cyring / CoreFreq

CoreFreq : CPU monitoring and tuning software designed for 64-bit processors.
https://www.cyring.fr
GNU General Public License v2.0
2k stars 126 forks source link

[AMD][Zen] SMU > Data Fabric > UMC #196

Closed cyring closed 3 years ago

cyring commented 4 years ago

Development notes

2020-07-15

[  340.567031] CoreFreq(12:28): Processor [ 8F_71] Architecture [Zen2/Matisse] SMT [32/32]
[  340.572209] Welcome to the Data Fabric UMC(0) @ 0x00050000:
               0x030[0x00150508] 0x080[0x00000000] 0x100[0x80000200]
               0x104[0xb040808b] 0x14c[0x00000000]
               0xdf0[0x00010030] 0xdf4[0x00000000]
[  340.572217] Welcome to the Data Fabric UMC(1) @ 0x00150000:
               0x030[0x00150508] 0x080[0x00000000] 0x100[0x80000200]
               0x104[0xb040808b] 0x14c[0x00000000]
               0xdf0[0x00010030] 0xdf4[0x00000000]
[  340.572218] CHA[0]   CHIP_BAR[0][0]=0x00050000 CHIP_BAR[0][1]=0x00050020
                        CHIP_BAR[1][0]=0x00050010 CHIP_BAR[1][1]=0x00050028
[  340.572219] CHA[0] CHIP[0:0] @ 0x00050000[0x00000000] Disable
[  340.572221] CHA[0] MASK[0:0] @ 0x00050020[0x00000000]
[  340.572222] CHA[0] CHIP[0:1] @ 0x00050010[0x00000000] Disable
[  340.572223] CHA[0] MASK[0:1] @ 0x00050028[0x00000000]
[  340.572225] CHA[0] CHIP[1:0] @ 0x00050004[0x00000000] Disable
[  340.572226] CHA[0] MASK[1:0] @ 0x00050020[0x00000000]
[  340.572227] CHA[0] CHIP[1:1] @ 0x00050014[0x00000000] Disable
[  340.572229] CHA[0] MASK[1:1] @ 0x00050028[0x00000000]
[  340.572230] CHA[0] CHIP[2:0] @ 0x00050008[0x00000001] Enable
[  340.572232] CHA[0] MASK[2:0] @ 0x00050024[0x03fffdfe] ChipSize[8388608]
[  340.572233] CHA[0] CHIP[2:1] @ 0x00050018[0x00000000] Disable
[  340.572235] CHA[0] MASK[2:1] @ 0x0005002c[0x00000000]
[  340.572236] CHA[0] CHIP[3:0] @ 0x0005000c[0x00000201] Enable
[  340.572238] CHA[0] MASK[3:0] @ 0x00050024[0x03fffdfe] ChipSize[8388608]
[  340.572239] CHA[0] CHIP[3:1] @ 0x0005001c[0x00000000] Disable
[  340.572240] CHA[0] MASK[3:1] @ 0x0005002c[0x00000000]
[  340.572241] Memory Size[16777216 KB] [16384 MB]
[  340.572242] CHA[1]   CHIP_BAR[0][0]=0x00150000 CHIP_BAR[0][1]=0x00150020
                        CHIP_BAR[1][0]=0x00150010 CHIP_BAR[1][1]=0x00150028
[  340.572243] CHA[1] CHIP[0:0] @ 0x00150000[0x00000000] Disable
[  340.572244] CHA[1] MASK[0:0] @ 0x00150020[0x00000000]
[  340.572246] CHA[1] CHIP[0:1] @ 0x00150010[0x00000000] Disable
[  340.572247] CHA[1] MASK[0:1] @ 0x00150028[0x00000000]
[  340.572248] CHA[1] CHIP[1:0] @ 0x00150004[0x00000000] Disable
[  340.572250] CHA[1] MASK[1:0] @ 0x00150020[0x00000000]
[  340.572251] CHA[1] CHIP[1:1] @ 0x00150014[0x00000000] Disable
[  340.572253] CHA[1] MASK[1:1] @ 0x00150028[0x00000000]
[  340.572254] CHA[1] CHIP[2:0] @ 0x00150008[0x00000001] Enable
[  340.572255] CHA[1] MASK[2:0] @ 0x00150024[0x03fffdfe] ChipSize[8388608]
[  340.572257] CHA[1] CHIP[2:1] @ 0x00150018[0x00000000] Disable
[  340.572258] CHA[1] MASK[2:1] @ 0x0015002c[0x00000000]
[  340.572260] CHA[1] CHIP[3:0] @ 0x0015000c[0x00000201] Enable
[  340.572261] CHA[1] MASK[3:0] @ 0x00150024[0x03fffdfe] ChipSize[8388608]
[  340.572262] CHA[1] CHIP[3:1] @ 0x0015001c[0x00000000] Disable
[  340.572264] CHA[1] MASK[3:1] @ 0x0015002c[0x00000000]
[  340.572264] Memory Size[16777216 KB] [16384 MB]

2020-07-14

[11986.233958] CoreFreq(9:-1): Processor [ 8F_71] Architecture [Zen2/Matisse] CPU [16/16]
[11986.235034] Welcome to the Data Fabric UMC(0) @ 50000:
               0x30[0x150508] 0x80[0x0] 0x100[0x80000200]
               0x104[0xb040808b] 0x14c[0x0]
               0xdf0[0x10030] 0xdf4[0x0]
[11986.235042] Welcome to the Data Fabric UMC(1) @ 150000:
               0x30[0x150508] 0x80[0x0] 0x100[0x80000200]
               0x104[0xb040808b] 0x14c[0x0]
               0xdf0[0x10030] 0xdf4[0x0]
[11986.235042] 0xe8@d18f3[0x0]
[11986.235043] CHA[0]   CHIP_BAR[0][0]=0x50000 CHIP_BAR[0][1]=0x50020
                        CHIP_BAR[1][0]=0x50010 CHIP_BAR[1][1]=0x50028
[11986.235044] CHA[0] CHIP[0:0] @ 0x50000[0x0]
[11986.235046] CHA[0] CHIP[0:1] @ 0x50010[0x0]
[11986.235047] CHA[0] CHIP[1:0] @ 0x50004[0x0]
[11986.235048] CHA[0] CHIP[1:1] @ 0x50014[0x0]
[11986.235050] CHA[0] CHIP[2:0] @ 0x50008[0x1]
[11986.235051] CHA[0] CHIP[2:1] @ 0x50018[0x0]
[11986.235052] CHA[0] CHIP[3:0] @ 0x5000c[0x201]
[11986.235054] CHA[0] CHIP[3:1] @ 0x5001c[0x0]
[11986.235055] CHA[0] MASK[0:0] @ 0x50020[0x0]
[11986.235056] CHA[0] MASK[0:1] @ 0x50028[0x0]
[11986.235058] CHA[0] MASK[1:0] @ 0x50024[0x3fffdfe]
[11986.235059] CHA[0] MASK[1:1] @ 0x5002c[0x0]
[11986.235060] CHA[1]   CHIP_BAR[0][0]=0x150000 CHIP_BAR[0][1]=0x150020
                        CHIP_BAR[1][0]=0x150010 CHIP_BAR[1][1]=0x150028
[11986.235061] CHA[1] CHIP[0:0] @ 0x150000[0x0]
[11986.235063] CHA[1] CHIP[0:1] @ 0x150010[0x0]
[11986.235064] CHA[1] CHIP[1:0] @ 0x150004[0x0]
[11986.235065] CHA[1] CHIP[1:1] @ 0x150014[0x0]
[11986.235067] CHA[1] CHIP[2:0] @ 0x150008[0x1]
[11986.235068] CHA[1] CHIP[2:1] @ 0x150018[0x0]
[11986.235070] CHA[1] CHIP[3:0] @ 0x15000c[0x201]
[11986.235071] CHA[1] CHIP[3:1] @ 0x15001c[0x0]
[11986.235072] CHA[1] MASK[0:0] @ 0x150020[0x0]
[11986.235074] CHA[1] MASK[0:1] @ 0x150028[0x0]
[11986.235075] CHA[1] MASK[1:0] @ 0x150024[0x3fffdfe]
[11986.235076] CHA[1] MASK[1:1] @ 0x15002c[0x0]

2020-07-13

[17611.473676] Welcome to the Data Fabric UMC(0):
               0x80[0x0] 0x100[0x80000200] 0x104[0xb040808b]
               0xdf0[0x10030] 0xdf4[0x0]
[17611.473682] Welcome to the Data Fabric UMC(1):
               0x80[0x0] 0x100[0x80000200] 0x104[0xb040808b]
               0xdf0[0x10030] 0xdf4[0x0]

UMC Config

0x80000200 = 0b10000000000000000000001000000000

thus bit 9 and 31 enabled

SDP

0xb040808b = 0b10110000010000001000000010001011

bit 31 (SdpInit) in both UMC, we have two channels

cyring commented 4 years ago

Alpha Source Code

2020-07-16 17:35

removed

Thank you for your testings: please post the CoreFreq driver output part from the kernel log.

adatum commented 4 years ago

This crashed my system immediately upon attempting to insert the kernel module: sudo insmod corefreqk.ko

I tried three times, with and without Experimental=1, and once again after redownloading the source code in case it had gotten corrupted.

cyring commented 4 years ago

This crashed my system immediately upon attempting to insert the kernel module: sudo insmod corefreqk.ko

I tried three times, with and without Experimental=1, and once again after redownloading the source code in case it had gotten corrupted.

COMPATIBLE will be confirmed by this pragma Building with Kernel amd_smn_read() during the build.

adatum commented 4 years ago

No change. I tried make HWM_CHIPSET=COMPATIBLE clean all, and I also tried the same after removing k10temp and asus_wmi_sensors with rmmod.

Btw, by crash, I mean freeze. The display freezes, mouse or keyboard inputs don't work, and the system fails to respond to pings.

cyring commented 4 years ago

No change. I tried make HWM_CHIPSET=COMPATIBLE clean all, and I also tried the same after removing k10temp and asus_wmi_sensors with rmmod.

Btw, by crash, I mean freeze. The display freezes, mouse or keyboard inputs don't work, and the system fails to respond to pings.

Found one bug within the asm code compiled by gas Now I'm struggling to make a SMU locking mechanism...

cyring commented 4 years ago

Locking added to the alpha version 2020-07-16 17:35 Thank you for your testings.

adatum commented 4 years ago

Unfortunately the latest version too froze the system. I used make HWM_CHIPSET=COMPATIBLE clean all again.

cyring commented 4 years ago

Unfortunately the latest version too froze the system. I used make HWM_CHIPSET=COMPATIBLE clean all again.

Thanks for this test.

cyring commented 4 years ago

Hello

Here is zencli.c

cc -g zencli.c -o zencli

sudo ./zencli smu 0x50030
0x150508 (1377544)

sudo ./zencli smu 0x50100
0x80000200 (2147484160)

sudo ./zencli smu 0x50104
0xb040808b (2957017227)

sudo ./zencli smu 0x50df0
0x10030 (65584)

Thank you

adatum commented 4 years ago

zencli was a bit more successful:

sudo ./zencli smu 0x50030
0x150508 (1377544)

sudo ./zencli smu 0x50100
0x80000200 (2147484160)

sudo ./zencli smu 0x50104
0xb0408082 (2957017218)

sudo ./zencli smu 0x50df0
0x1fe2c (130604)
cyring commented 4 years ago

zencli was a bit more successful:

Great. Let's go further. Please download and build the latest zencli.c then read the UMC as below

sudo ./zencli umc 0x0

Welcome to the Data Fabric: UMC has 2 x Channel(s)

CHA[0]  CHIP_BAR[0][0]=0x00050000 CHIP_BAR[0][1]=0x00050020
                CHIP_BAR[1][0]=0x00050010 CHIP_BAR[1][1]=0x00050028
CHA[0] CHIP[0:0] @ 0x00050000[0x00000000] Disable
CHA[0] MASK[0:0] @ 0x00050020[0x00000000]
CHA[0] CHIP[0:1] @ 0x00050010[0x00000000] Disable
CHA[0] MASK[0:1] @ 0x00050028[0x00000000]
CHA[0] CHIP[1:0] @ 0x00050004[0x00000000] Disable
CHA[0] MASK[1:0] @ 0x00050020[0x00000000]
CHA[0] CHIP[1:1] @ 0x00050014[0x00000000] Disable
CHA[0] MASK[1:1] @ 0x00050028[0x00000000]
CHA[0] CHIP[2:0] @ 0x00050008[0x00000001] Enable
CHA[0] MASK[2:0] @ 0x00050024[0x03fffdfe] ChipSize[8388608]
CHA[0] CHIP[2:1] @ 0x00050018[0x00000000] Disable
CHA[0] MASK[2:1] @ 0x0005002c[0x00000000]
CHA[0] CHIP[3:0] @ 0x0005000c[0x00000201] Enable
CHA[0] MASK[3:0] @ 0x00050024[0x03fffdfe] ChipSize[8388608]
CHA[0] CHIP[3:1] @ 0x0005001c[0x00000000] Disable
CHA[0] MASK[3:1] @ 0x0005002c[0x00000000]
Memory Size[16777216 KB] [16384 MB]
CHA[1]  CHIP_BAR[0][0]=0x00150000 CHIP_BAR[0][1]=0x00150020
                CHIP_BAR[1][0]=0x00150010 CHIP_BAR[1][1]=0x00150028
CHA[1] CHIP[0:0] @ 0x00150000[0x00000000] Disable
CHA[1] MASK[0:0] @ 0x00150020[0x00000000]
CHA[1] CHIP[0:1] @ 0x00150010[0x00000000] Disable
CHA[1] MASK[0:1] @ 0x00150028[0x00000000]
CHA[1] CHIP[1:0] @ 0x00150004[0x00000000] Disable
CHA[1] MASK[1:0] @ 0x00150020[0x00000000]
CHA[1] CHIP[1:1] @ 0x00150014[0x00000000] Disable
CHA[1] MASK[1:1] @ 0x00150028[0x00000000]
CHA[1] CHIP[2:0] @ 0x00150008[0x00000001] Enable
CHA[1] MASK[2:0] @ 0x00150024[0x03fffdfe] ChipSize[8388608]
CHA[1] CHIP[2:1] @ 0x00150018[0x00000000] Disable
CHA[1] MASK[2:1] @ 0x0015002c[0x00000000]
CHA[1] CHIP[3:0] @ 0x0015000c[0x00000201] Enable
CHA[1] MASK[3:0] @ 0x00150024[0x03fffdfe] ChipSize[8388608]
CHA[1] CHIP[3:1] @ 0x0015001c[0x00000000] Disable
CHA[1] MASK[3:1] @ 0x0015002c[0x00000000]
Memory Size[16777216 KB] [16384 MB]
adatum commented 4 years ago
sudo ./zencli umc 0x0

Welcome to the Data Fabric: UMC has 2 x Channel(s)

CHA[0]  CHIP_BAR[0][0]=0x00050000 CHIP_BAR[0][1]=0x00050020
        CHIP_BAR[1][0]=0x00050010 CHIP_BAR[1][1]=0x00050028
CHA[0] CHIP[0:0] @ 0x00050000[0x00000000] Disable
CHA[0] MASK[0:0] @ 0x00050020[0x00000000]
CHA[0] CHIP[0:1] @ 0x00050010[0x00000000] Disable
CHA[0] MASK[0:1] @ 0x00050028[0x00000000]
CHA[0] CHIP[1:0] @ 0x00050004[0x00000000] Disable
CHA[0] MASK[1:0] @ 0x00050020[0x00000000]
CHA[0] CHIP[1:1] @ 0x00050014[0x00000000] Disable
CHA[0] MASK[1:1] @ 0x00050028[0x00000000]
CHA[0] CHIP[2:0] @ 0x00050008[0x00000001] Enable
CHA[0] MASK[2:0] @ 0x00050024[0x03fffdfe] ChipSize[8388608]
CHA[0] CHIP[2:1] @ 0x00050018[0x00000000] Disable
CHA[0] MASK[2:1] @ 0x0005002c[0x00000000]
CHA[0] CHIP[3:0] @ 0x0005000c[0x00000201] Enable
CHA[0] MASK[3:0] @ 0x00050024[0x03fffdfe] ChipSize[8388608]
CHA[0] CHIP[3:1] @ 0x0005001c[0x00000000] Disable
CHA[0] MASK[3:1] @ 0x0005002c[0x00000000]
Memory Size[16777216 KB] [16384 MB]
CHA[1]  CHIP_BAR[0][0]=0x00150000 CHIP_BAR[0][1]=0x00150020
        CHIP_BAR[1][0]=0x00150010 CHIP_BAR[1][1]=0x00150028
CHA[1] CHIP[0:0] @ 0x00150000[0x00000000] Disable
CHA[1] MASK[0:0] @ 0x00150020[0x00000000]
CHA[1] CHIP[0:1] @ 0x00150010[0x00000000] Disable
CHA[1] MASK[0:1] @ 0x00150028[0x00000000]
CHA[1] CHIP[1:0] @ 0x00150004[0x00000000] Disable
CHA[1] MASK[1:0] @ 0x00150020[0x00000000]
CHA[1] CHIP[1:1] @ 0x00150014[0x00000000] Disable
CHA[1] MASK[1:1] @ 0x00150028[0x00000000]
CHA[1] CHIP[2:0] @ 0x00150008[0x00000001] Enable
CHA[1] MASK[2:0] @ 0x00150024[0x03fffdfe] ChipSize[8388608]
CHA[1] CHIP[2:1] @ 0x00150018[0x00000000] Disable
CHA[1] MASK[2:1] @ 0x0015002c[0x00000000]
CHA[1] CHIP[3:0] @ 0x0015000c[0x00000201] Enable
CHA[1] MASK[3:0] @ 0x00150024[0x03fffdfe] ChipSize[8388608]
CHA[1] CHIP[3:1] @ 0x0015001c[0x00000000] Disable
CHA[1] MASK[3:1] @ 0x0015002c[0x00000000]
Memory Size[16777216 KB] [16384 MB]
cyring commented 4 years ago

So I'm lost ! It is barely the same UMC code as the driver one.

In this version of CoreFreq the whole UMC code is commented, just to check if the issue comes from somewhere else ?

Be prepared for a crash

removed

adatum commented 4 years ago

Still crashing.

Yes, I didn't do a diff, but my UMC code output looks identical!

cyring commented 4 years ago

May be you already have a crash with develop commit 790ce5f1dd423c1e0cc2d363dd762bd84b8fc678 ... where I've added 2 Mitigation MSR. Their availability might be a function of the firmware.

To avoid a crash, please test as root those registers:

modprobe msr
rdmsr -aX 0x00000048
0
0
...
0
rdmsr -aX 0x00000049
rdmsr: CPU 0 cannot read MSR 0x00000049
adatum commented 4 years ago
modprobe msr
rdmsr -aX 0x00000048
rdmsr: CPU 0 cannot read MSR 0x00000048
rdmsr -aX 0x00000049
rdmsr: CPU 0 cannot read MSR 0x00000049
cyring commented 4 years ago
modprobe msr
rdmsr -aX 0x00000048
rdmsr: CPU 0 cannot read MSR 0x00000048
rdmsr -aX 0x00000049
rdmsr: CPU 0 cannot read MSR 0x00000049

Here we are ! SPEC_CTRL (0x00000048) can not be read

Can you rollback to CoreFreq master then dump the CPUID. The output for CPU #0 will be enough

corefreq-cli -u
adatum commented 4 years ago

For curiosity (before your reply) I did confirm that commit 790ce5f (1.79-33-g790ce5f) (develop branch) did cause the crash too.

With 1.79-23-gbefddf5 (master branch)

$ corefreq-cli -u
CPU #0   function         EAX          EBX          ECX          EDX            
|- 00000000:00000000    0000000d     68747541     444d4163     69746e65         
   |- Largest Standard Function=0000000d                                        
|- 80000000:00000000    8000001f     68747541     444d4163     69746e65         
   |- Largest Extended Function=8000001f                                        
|- 00000001:00000000    00800f82     00100800     7ed8320b     178bfbff         
|- 00000002:00000000    00000000     00000000     00000000     00000000         
|- 00000003:00000000    00000000     00000000     00000000     00000000         
|- 00000004:00000000    00000000     00000000     00000000     00000000         
|- 00000004:00000001    00000000     00000000     00000000     00000000         
|- 00000004:00000002    00000000     00000000     00000000     00000000         
|- 00000004:00000003    00000000     00000000     00000000     00000000         
|- 00000005:00000000    00000040     00000040     00000003     00000011         
|- 00000006:00000000    00000004     00000000     00000001     00000000         
|- 00000007:00000000    00000000     209c01a9     00000000     00000000         
|- 00000007:00000001    00000000     00000000     00000000     00000000         
|- 00000009:00000000    00000000     00000000     00000000     00000000         
|- 0000000a:00000000    00000000     00000000     00000000     00000000         
|- 0000000b:00000000    00000000     00000000     00000000     00000000         
|- 0000000d:00000000    00000007     00000340     00000340     00000000         
|- 0000000d:00000001    0000000f     00000340     00000000     00000000         
|- 0000000d:00000002    00000100     00000240     00000000     00000000         
|- 0000000d:00000003    00000000     00000000     00000000     00000000         
|- 0000000d:00000004    00000000     00000000     00000000     00000000         
|- 0000000d:0000003e    00000000     00000000     00000000     00000000         
|- 0000000f:00000000    00000000     00000000     00000000     00000000         
|- 0000000f:00000001    00000000     00000000     00000000     00000000         
|- 00000010:00000000    00000000     00000000     00000000     00000000         
|- 00000010:00000001    00000000     00000000     00000000     00000000         
|- 00000010:00000002    00000000     00000000     00000000     00000000         
|- 00000010:00000003    00000000     00000000     00000000     00000000         
|- 00000012:00000000    00000000     00000000     00000000     00000000         
|- 00000012:00000001    00000000     00000000     00000000     00000000         
|- 00000012:00000002    00000000     00000000     00000000     00000000         
|- 00000014:00000000    00000000     00000000     00000000     00000000         
|- 00000014:00000001    00000000     00000000     00000000     00000000         
|- 00000015:00000000    00000000     00000000     00000000     00000000         
|- 00000016:00000000    00000000     00000000     00000000     00000000         
|- 00000017:00000000    00000000     00000000     00000000     00000000         
|- 00000017:00000001    00000000     00000000     00000000     00000000         
|- 00000017:00000002    00000000     00000000     00000000     00000000         
|- 00000017:00000003    00000000     00000000     00000000     00000000         
|- 00000018:00000000    00000000     00000000     00000000     00000000         
|- 00000018:00000001    00000000     00000000     00000000     00000000         
|- 0000001a:00000000    00000000     00000000     00000000     00000000         
|- 0000001b:00000000    00000000     00000000     00000000     00000000         
|- 0000001f:00000000    00000000     00000000     00000000     00000000         
|- 80000001:00000000    00800f82     20000000     35c233ff     2fd3fbff         
|- 80000002:00000000    20444d41     657a7952     2037206e     30303732         
|- 80000003:00000000    69452058     2d746867     65726f43     6f725020         
|- 80000004:00000000    73736563     2020726f     20202020     00202020         
|- 80000005:00000000    ff40ff40     ff40ff40     20080140     40040140         
|- 80000006:00000000    26006400     66006400     02006140     00808140         
|- 80000007:00000000    00000000     0000001b     00000000     00006799         
|- 80000008:00000000    00003030     00001007     0000400f     00000000         
|- 8000000a:00000000    00000001     00008000     00000000     0001bcff         
|- 80000019:00000000    f040f040     00000000     00000000     00000000         
|- 8000001a:00000000    00000003     00000000     00000000     00000000         
|- 8000001b:00000000    000003ff     00000000     00000000     00000000         
|- 8000001c:00000000    00000000     00000000     00000000     00000000         
|- 8000001d:00000000    00004121     01c0003f     0000003f     00000000         
|- 8000001d:00000001    00004122     00c0003f     000000ff     00000000         
|- 8000001d:00000002    00004143     01c0003f     000003ff     00000002         
|- 8000001d:00000003    0001c163     03c0003f     00001fff     00000001         
|- 8000001e:00000000    00000000     00000100     00000000     00000000         
|- 40000000:00000000    00000000     00000000     00000000     00000000         
|- 40000001:00000000    00000000     00000000     00000000     00000000         
|- 40000002:00000000    00000000     00000000     00000000     00000000         
|- 40000003:00000000    00000000     00000000     00000000     00000000         
|- 40000004:00000000    00000000     00000000     00000000     00000000         
|- 40000005:00000000    00000000     00000000     00000000     00000000         
|- 40000006:00000000    00000000     00000000     00000000     00000000         
cyring commented 4 years ago

AMD Processor Programming Reference

CPUID_Fn8000000A_EDX [SVM Revision and Feature Identification] Bits Description
20 GuestSpecCtrl. Read-only. Reset: Fixed,1. 1=Indicates support for Guest SPEC_CTRL.

Yours

CPU #0   function         EAX          EBX          ECX          EDX            
|- 8000000a:00000000    00000001     00008000     00000000     0001bcff

0x1BCFF = 0b000011011110011111111

Mine

CPU #0   function         EAX          EBX          ECX          EDX            
|- 8000000a:00000000    00000001     00008000     00000000     0013bcff 

0x13BCFF = 0b100111011110011111111

Fix

adatum commented 4 years ago

It worked :)

Not sure if necessary but I used make HWM_CHIPSET=COMPATIBLE clean all.

mitigation

In case you still want the driver output from the system log:

CoreFreq(4:12): Processor [ 8F_08] Architecture [Zen+ Pinnacle Ridge] SMT [16/16]
Welcome to the Data Fabric UMC(0) @ 0x00050000:
0x030[0x00150508] 0x080[0x00000000] 0x100[0x80000200]
0x104[0xb0408082] 0x14c[0x00000000]
0xdf0[0x0001fe2c] 0xdf4[0x00000000]
Welcome to the Data Fabric UMC(1) @ 0x00150000:
0x030[0x00150508] 0x080[0x00000000] 0x100[0x80000200]
0x104[0xb0408082] 0x14c[0x00000000]
0xdf0[0x0001fe2c] 0xdf4[0x00000000]
CHA[0]        CHIP_BAR[0][0]=0x00050000 CHIP_BAR[0][1]=0x00050020
                CHIP_BAR[1][0]=0x00050010 CHIP_BAR[1][1]=0x00050028
CHA[0] CHIP[0:0] @ 0x00050000[0x00000000] Disable
CHA[0] MASK[0:0] @ 0x00050020[0x00000000]
CHA[0] CHIP[0:1] @ 0x00050010[0x00000000] Disable
CHA[0] MASK[0:1] @ 0x00050028[0x00000000]
CHA[0] CHIP[1:0] @ 0x00050004[0x00000000] Disable
CHA[0] MASK[1:0] @ 0x00050020[0x00000000]
CHA[0] CHIP[1:1] @ 0x00050014[0x00000000] Disable
CHA[0] MASK[1:1] @ 0x00050028[0x00000000]
CHA[0] CHIP[2:0] @ 0x00050008[0x00000001] Enable
CHA[0] MASK[2:0] @ 0x00050024[0x03fffdfe] ChipSize[8388608]
CHA[0] CHIP[2:1] @ 0x00050018[0x00000000] Disable
CHA[0] MASK[2:1] @ 0x0005002c[0x00000000]
CHA[0] CHIP[3:0] @ 0x0005000c[0x00000201] Enable
CHA[0] MASK[3:0] @ 0x00050024[0x03fffdfe] ChipSize[8388608]
CHA[0] CHIP[3:1] @ 0x0005001c[0x00000000] Disable
CHA[0] MASK[3:1] @ 0x0005002c[0x00000000]
Memory Size[16777216 KB] [16384 MB]
CHA[1]        CHIP_BAR[0][0]=0x00150000 CHIP_BAR[0][1]=0x00150020
                CHIP_BAR[1][0]=0x00150010 CHIP_BAR[1][1]=0x00150028
CHA[1] CHIP[0:0] @ 0x00150000[0x00000000] Disable
CHA[1] MASK[0:0] @ 0x00150020[0x00000000]
CHA[1] CHIP[0:1] @ 0x00150010[0x00000000] Disable
CHA[1] MASK[0:1] @ 0x00150028[0x00000000]
CHA[1] CHIP[1:0] @ 0x00150004[0x00000000] Disable
CHA[1] MASK[1:0] @ 0x00150020[0x00000000]
CHA[1] CHIP[1:1] @ 0x00150014[0x00000000] Disable
CHA[1] MASK[1:1] @ 0x00150028[0x00000000]
CHA[1] CHIP[2:0] @ 0x00150008[0x00000001] Enable
CHA[1] MASK[2:0] @ 0x00150024[0x03fffdfe] ChipSize[8388608]
CHA[1] CHIP[2:1] @ 0x00150018[0x00000000] Disable
CHA[1] MASK[2:1] @ 0x0015002c[0x00000000]
CHA[1] CHIP[3:0] @ 0x0015000c[0x00000201] Enable
CHA[1] MASK[3:0] @ 0x00150024[0x03fffdfe] ChipSize[8388608]
CHA[1] CHIP[3:1] @ 0x0015001c[0x00000000] Disable
CHA[1] MASK[3:1] @ 0x0015002c[0x00000000]
Memory Size[16777216 KB] [16384 MB]
cyring commented 4 years ago

It worked :)

But I'm using the wrong Capability bits. The good ones belong to the CPUID leaf 0x80000008:EBX Can you test this version ?

removed

Not sure if necessary but I used make HWM_CHIPSET=COMPATIBLE clean all.

You don't have to if no other drivers are running (k10temp) Without it, CoreFreq queries directly the sensors. It will improve latency.

In case you still want the driver output from the system log:

Yes, I try to find a match between the UMC output we're getting and the DIMM location. Can you tell how your DIMM are populated on the motherboard ?

My 2 x 16 GB DIMM are slotted like this:

DIMM B1[ ]  DIMM B2[X]  DIMM A1[ ] DIMM A2[X]
adatum commented 4 years ago

I do have k10temp and asus-wmi-sensors running all the time, so I'll continue using make HWM_CHIPSET=COMPATIBLE clean all for now.

CoreFreq Daemon 1.80.1

mitigation2

system log output ``` CoreFreq(3:11): Processor [ 8F_08] Architecture [Zen+ Pinnacle Ridge] SMT [16/16] Welcome to the Data Fabric UMC(0) @ 0x00050000: 0x030[0x00150508] 0x080[0x00000000] 0x100[0x80000200] 0x104[0xb0408082] 0x14c[0x00000000] 0xdf0[0x0001fe2c] 0xdf4[0x00000000] Welcome to the Data Fabric UMC(1) @ 0x00150000: 0x030[0x00150508] 0x080[0x00000000] 0x100[0x80000200] 0x104[0xb0408082] 0x14c[0x00000000] 0xdf0[0x0001fe2c] 0xdf4[0x00000000] CHA[0] CHIP_BAR[0][0]=0x00050000 CHIP_BAR[0][1]=0x00050020 CHIP_BAR[1][0]=0x00050010 CHIP_BAR[1][1]=0x00050028 CHA[0] CHIP[0:0] @ 0x00050000[0x00000000] Disable CHA[0] MASK[0:0] @ 0x00050020[0x00000000] CHA[0] CHIP[0:1] @ 0x00050010[0x00000000] Disable CHA[0] MASK[0:1] @ 0x00050028[0x00000000] CHA[0] CHIP[1:0] @ 0x00050004[0x00000000] Disable CHA[0] MASK[1:0] @ 0x00050020[0x00000000] CHA[0] CHIP[1:1] @ 0x00050014[0x00000000] Disable CHA[0] MASK[1:1] @ 0x00050028[0x00000000] CHA[0] CHIP[2:0] @ 0x00050008[0x00000001] Enable CHA[0] MASK[2:0] @ 0x00050024[0x03fffdfe] ChipSize[8388608] CHA[0] CHIP[2:1] @ 0x00050018[0x00000000] Disable CHA[0] MASK[2:1] @ 0x0005002c[0x00000000] CHA[0] CHIP[3:0] @ 0x0005000c[0x00000201] Enable CHA[0] MASK[3:0] @ 0x00050024[0x03fffdfe] ChipSize[8388608] CHA[0] CHIP[3:1] @ 0x0005001c[0x00000000] Disable CHA[0] MASK[3:1] @ 0x0005002c[0x00000000] Memory Size[16777216 KB] [16384 MB] CHA[1] CHIP_BAR[0][0]=0x00150000 CHIP_BAR[0][1]=0x00150020 CHIP_BAR[1][0]=0x00150010 CHIP_BAR[1][1]=0x00150028 CHA[1] CHIP[0:0] @ 0x00150000[0x00000000] Disable CHA[1] MASK[0:0] @ 0x00150020[0x00000000] CHA[1] CHIP[0:1] @ 0x00150010[0x00000000] Disable CHA[1] MASK[0:1] @ 0x00150028[0x00000000] CHA[1] CHIP[1:0] @ 0x00150004[0x00000000] Disable CHA[1] MASK[1:0] @ 0x00150020[0x00000000] CHA[1] CHIP[1:1] @ 0x00150014[0x00000000] Disable CHA[1] MASK[1:1] @ 0x00150028[0x00000000] CHA[1] CHIP[2:0] @ 0x00150008[0x00000001] Enable CHA[1] MASK[2:0] @ 0x00150024[0x03fffdfe] ChipSize[8388608] CHA[1] CHIP[2:1] @ 0x00150018[0x00000000] Disable CHA[1] MASK[2:1] @ 0x0015002c[0x00000000] CHA[1] CHIP[3:0] @ 0x0015000c[0x00000201] Enable CHA[1] MASK[3:0] @ 0x00150024[0x03fffdfe] ChipSize[8388608] CHA[1] CHIP[3:1] @ 0x0015001c[0x00000000] Disable CHA[1] MASK[3:1] @ 0x0015002c[0x00000000] Memory Size[16777216 KB] [16384 MB] ```

My 2 x 16 GB DIMM are slotted like this:

DIMM B1[ ] DIMM B2[X] DIMM A1[ ] DIMM A2[X]

Exactly the same here.

cyring commented 4 years ago

Pushing this fix to the develop branch. Thank you for your help.

cyring commented 4 years ago

This is where am I now. Basically, the DIMM size and the Channel mode. CoreFreq_DIMM_geometry_WiP Available in develop for your testings.

adatum commented 4 years ago

So far so good:

dimms

cyring commented 4 years ago

Nice. Thank you. More to come...

adatum commented 4 years ago

Btw, I used simple make -j and then insmod corefreqk.ko and it was fine.

cyring commented 4 years ago

Btw, I used simple make -j and then insmod corefreqk.ko and it was fine.

When working with develop I recommend to fully rebuild and reload CoreFreq because I might have changed the API without updating the version. This could lead to a crash.

Thus, always build this way:

make clean all
rmmod corefreqk
insmod corefreqk.ko

Beside those AMD functionalities, the Experimental mode is not required :

adatum commented 4 years ago

I never keep corefreqk running after testing, and the daemon is also run in the foreground, not background. After each test my routine is to Ctrl-C the running corefreqd and then rmmod corefreqk.

I also make clean before doing git pull and make. Is make clean all equivalent to make clean then make?

cyring commented 4 years ago

Is make clean all equivalent to make clean then make?

Yes, the same in one command.

cyring commented 4 years ago

New develop version with the UMC timings and Speed decoded

CoreFreq_UMC

adatum commented 4 years ago

umc

cyring commented 4 years ago

Thank you for your test. Do all the timings and DDR speed match the BIOS settings ?

adatum commented 4 years ago

Yes, although the BIOS mentions CHA and CHB, which correspond to Cha#1 and Cha#0 from CoreFreq, respectively, based on the dissimilar values for RdWr and WrRd.

cyring commented 4 years ago

Yes, although the BIOS mentions CHA and CHB, which correspond to Cha#1 and Cha#0 from CoreFreq, respectively, based on the dissimilar values for RdWr and WrRd.

Thanks a lot for your confirmation.

It will be difficult to please all BIOS terminologies. I believe Timings are called differently among manufacturers. Except the channel id, I'm doing the same as the ASUS board, but cells are 5 characters only, space included, to name each item.
Perhaps, sticking to the DRAM terminology would be better...

adatum commented 4 years ago

My comment is not about the terminology, but the correspondence between CH A/B in BIOS and Cha 1/0 in CoreFreq.

cyring commented 4 years ago

My comment is not about the terminology, but the correspondence between CH A/B in BIOS and Cha 1/0 in CoreFreq.

Mine is 0 for cha A, 1 for B So there's an issue in the topology. I will need more tests from other brands to understand the registers encoding.

cyring commented 4 years ago
cyring commented 4 years ago
adatum commented 4 years ago

memcontroller

cyring commented 4 years ago

Thanks for the screenshot. About the RdWr and WrRd timing different values per channel, can you confirm if it is a CoreFreq bug or it just reflects what it is set in BIOS ?

adatum commented 4 years ago

About the RdWr and WrRd timing different values per channel, can you confirm if it is a CoreFreq bug or it just reflects what it is set in BIOS ?

They are reflected in BIOS. I was surprised actually. Did not know they can be different, and still don't know if they should. I just have DOCP (XMP) set in BIOS and no manual memory timings.

cyring commented 4 years ago
adatum commented 4 years ago

umc

Ropid commented 4 years ago

The "ECC" reading might be wrong. I use ECC RAM here on a Ryzen 7 2700X and it says "0" in the ECC column. I looked at the latest commit in the "master" branch, not "develop".

$ corefreq-cli -M
                         Zen  [1463]                                  
Controller #0                                           Dual Channel  
 Bus Rate  1566 MHz       Bus Speed 1566 MHz      DRAM Speed 3133 MHz 

 Cha    CL RCD_R RCD_W RP  RAS  RC RRD_S RRD_L FAW WTR_S WTR_L WR clRR
  #0    14   17   14   15   31   50    4    4   16    4   10   10    4
  #1    14   17   14   15   31   50    4    4   16    4   10   10    4
      clWW  CWL  RTP RdWr WrRd scWW sdWW ddWW scRR sdRR ddRR  ECC Rate
  #0     2   14    6    6    3    1    6    5    1    4    3   0    1N
  #1     2   14    6    6    3    1    6    5    1    4    3   0    1N

 DIMM Geometry for channel #0                                         
      Slot Bank Rank     Rows   Columns    Memory Size (MB)           
       #0                                                             
       #1     2   16     65536      1024          16384               
 DIMM Geometry for channel #1                                         
      Slot Bank Rank     Rows   Columns    Memory Size (MB)           
       #0                                                             
       #1     2   16     65536      1024          16384               

ECC is enabled and working. The kernel log has this:

[    5.098744] EDAC amd64: Node 0: DRAM ECC enabled.

And here is an actual error from a few months ago:

$ ras-mc-ctl --errors
...
32 2020-02-10 15:54:39 +0100 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=16), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=3, mcgcap=0x00000117, status=0x9c2040000000011b, addr=0x215321700, misc=0xd01a000101000000, walltime=0x5e416eb0, cpuid=0x00800f82, bank=0x00000010
cyring commented 4 years ago

The "ECC" reading might be wrong. I use ECC RAM here on a Ryzen 7 2700X and it says "0" in the ECC column. I looked at the latest commit in the "master" branch, not "develop".

$ corefreq-cli -M
                         Zen  [1463]                                  
Controller #0                                           Dual Channel  
 Bus Rate  1566 MHz       Bus Speed 1566 MHz      DRAM Speed 3133 MHz 

 Cha    CL RCD_R RCD_W RP  RAS  RC RRD_S RRD_L FAW WTR_S WTR_L WR clRR
  #0    14   17   14   15   31   50    4    4   16    4   10   10    4
  #1    14   17   14   15   31   50    4    4   16    4   10   10    4
      clWW  CWL  RTP RdWr WrRd scWW sdWW ddWW scRR sdRR ddRR  ECC Rate
  #0     2   14    6    6    3    1    6    5    1    4    3   0    1N
  #1     2   14    6    6    3    1    6    5    1    4    3   0    1N

 DIMM Geometry for channel #0                                         
      Slot Bank Rank     Rows   Columns    Memory Size (MB)           
       #0                                                             
       #1     2   16     65536      1024          16384               
 DIMM Geometry for channel #1                                         
      Slot Bank Rank     Rows   Columns    Memory Size (MB)           
       #0                                                             
       #1     2   16     65536      1024          16384               

ECC is enabled and working. The kernel log has this:

[    5.098744] EDAC amd64: Node 0: DRAM ECC enabled.

And here is an actual error from a few months ago:

$ ras-mc-ctl --errors
...
32 2020-02-10 15:54:39 +0100 error: Corrected error, no action required., CPU 2, bank Unified Memory Controller (bank=16), mcg mcgstatus=0, mci CECC, memory_channel=1,csrow=3, mcgcap=0x00000117, status=0x9c2040000000011b, addr=0x215321700, misc=0xd01a000101000000, walltime=0x5e416eb0, cpuid=0x00800f82, bank=0x00000010

Can you replace this bits layout in file: https://github.com/cyring/CoreFreq/blob/acf1a732565454e01623a8e5b717b868366d0638/amdmsr.h#L866

with this code:

typedef union
{   /* SMU: address = 0x50100                   */
    unsigned int        value;
    struct
    {
        unsigned int
        ReservedBits1   : 30-0,
        ECC_DIMM_Enable : 31-30,
        ReservedBits2   : 32-31;
    };
} AMD_17_UMC_CFG_ECC;

EDIT The testing change is available in current develop branch.
you don't have to edit code as requested above

Ropid commented 4 years ago

I get this with the 'develop' branch from right now, the ECC column is still zero:

$ corefreq-cli -M
                              Zen UMC  [1463]                              
Controller #0                                                Dual Channel  
 Bus Rate  1566 MT/s      Bus Speed 1566 MHz           DRAM Speed 3133 MHz 

 Cha    CL RCD_R RCD_W RP  RAS  RC RRD_S RRD_L FAW WTR_S WTR_L WR clRR clWW
  #0    14   17   14   15   31   50    4    4   16    4   10   10    4    2
  #1    14   17   14   15   31   50    4    4   16    4   10   10    4    2
       CWL  RTP RdWr WrRd scWW sdWW ddWW scRR sdRR ddRR dlr[RR WW   WR RRD]
  #0    14    6    6    3    1    6    5    1    4    3    0    0    0    0
  #1    14    6    6    3    1    6    5    1    4    3    0    0    0    0
      REFI  RFC RFC2 RFC4 RCPB RPPB sFAW dFAW Ban RCPage CKE  CMD  GDM  ECC
  #0 12226  400  400  400    0    0    0    0 R1W1    0    8   1T   ON    0
  #1 12226  400  400  400    0    0    0    0 R1W1    0    8   1T   ON    0

 DIMM Geometry for channel #0                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1     2   16     65536      1024          16384                    
 DIMM Geometry for channel #1                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1     2   16     65536      1024          16384                    
cyring commented 4 years ago

I get this with the 'develop' branch from right now, the ECC column is still zero:

Can you try again with the latest commit in develop

Remark: the driver API has changed, be sure to rebuild and reload all.

Thank you

Ropid commented 4 years ago

It works correctly, I think. Using commit 9f68310, I get this output here, it shows ECC = 1:

                              Zen UMC  [1463]                              
Controller #0                                                Dual Channel  
 Bus Rate  1566 MT/s      Bus Speed 1566 MHz           DRAM Speed 3133 MHz 

 Cha    CL RCD_R RCD_W RP  RAS  RC RRD_S RRD_L FAW WTR_S WTR_L WR clRR clWW
  #0    14   17   14   15   31   50    4    4   16    4   10   10    4    2
  #1    14   17   14   15   31   50    4    4   16    4   10   10    4    2
       CWL  RTP RdWr WrRd scWW sdWW ddWW scRR sdRR ddRR dlr[RR WW   WR RRD]
  #0    14    6    6    3    1    6    5    1    4    3    0    0    0    0
  #1    14    6    6    3    1    6    5    1    4    3    0    0    0    0
      REFI  RFC RFC2 RFC4 RCPB RPPB sFAW dFAW Ban RCPage CKE  CMD  GDM  ECC
  #0 12226  400  400  400    0    0    0    0 R1W1    0    8   1T   ON    1
  #1 12226  400  400  400    0    0    0    0 R1W1    0    8   1T   ON    1

 DIMM Geometry for channel #0                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1     2   16     65536      1024          16384                    
 DIMM Geometry for channel #1                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1     2   16     65536      1024          16384                    

I'll now try rebooting and disabling ECC in the BIOS menus and see what happens.

EDIT: I managed to find the option to disable ECC in the BIOS menus, the output of dmesg has this:

$ dmesg | grep -i '\becc\b'
[    4.382182] EDAC amd64: Node 0: DRAM ECC disabled.
[    4.382889] EDAC amd64: Node 0: DRAM ECC disabled.

And I can now see a "0" in corefreq-cli:

                              Zen UMC  [1463]                              
Controller #0                                                Dual Channel  
 Bus Rate  1566 MT/s      Bus Speed 1566 MHz           DRAM Speed 3133 MHz 

 Cha    CL RCD_R RCD_W RP  RAS  RC RRD_S RRD_L FAW WTR_S WTR_L WR clRR clWW
  #0    14   17   14   15   31   50    4    4   16    4   10   10    4    2
  #1    14   17   14   15   31   50    4    4   16    4   10   10    4    2
       CWL  RTP RdWr WrRd scWW sdWW ddWW scRR sdRR ddRR dlr[RR WW   WR RRD]
  #0    14    6    6    3    1    6    5    1    4    3    0    0    0    0
  #1    14    6    6    3    1    6    5    1    4    3    0    0    0    0
      REFI  RFC RFC2 RFC4 RCPB RPPB sFAW dFAW Ban RCPage CKE  CMD  GDM  ECC
  #0 12226  400  400  400    0    0    0    0 R1W1    0    8   1T   ON    0
  #1 12226  400  400  400    0    0    0    0 R1W1    0    8   1T   ON    0

 DIMM Geometry for channel #0                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1     2   16     65536      1024          16384                    
 DIMM Geometry for channel #1                                              
      Slot Bank Rank     Rows   Columns    Memory Size (MB)                
       #0                                                                  
       #1     2   16     65536      1024          16384                    

EDIT 2: Everything works great here. I enabled ECC again in the BIOS, and the output in corefreq changed back to "1".

cyring commented 4 years ago

EDIT 2: Everything works great here. I enabled ECC again in the BIOS, and the output in corefreq changed back to "1".

Thanks for this various tests.

All credits to the Linux kernel for unveiling those UMC registers. See amd64_edac.h