Max value hits 429C after some time on 7950X

tofurky commented 3 weeks ago

After running for a while, several cores will eventually get a bad reading of 429C and stick there until corefreqd is restarted.

Ryzen 7950X w/PBO enabled on ASUS ProArt X670E-Creator, kernel 6.6.32 built with gcc 13.2.0 on Ubuntu 24.04.

CoreFreq 117b8c04ef0ef18f877472c268aadc74dd3cffb4 from Wed Feb 28 02:40:18 2024 +0100. Currently now testing with tip of master, but the issue takes some time to show up and I'll update if/when it does.

System PBO offsets were tuned to undervolt most cores and overvolt 1 for stability. RAM is DDR5-6000 with Infinity Fabric at UCLK/1. UEFI is latest available with AGESA 1.1.7.0. No application crashes or other issues seen in dmesg. This issue has been happening as long as I've been letting corefreqd run continuously, so since last August according to .bash_history.

Unsure if this means there's an occasional bad read from the SMU or there's something else at play? Please let me know if you need any other information, thanks.

Screenshot:

cyring commented 3 weeks ago

Hello,

Thank you for reporting this issue.

Sorry for all details that will follow; I need to simulate the call flow of thermal which starts in this Driver function

https://github.com/cyring/CoreFreq/blob/7c8d354aeed79bcde59d2fcadbab899b60137a74/x86_64/corefreqk.c#L16233

Temperature is indeed gotten from SMU; TCCD_REGISTER address interlaced per CCD

https://github.com/cyring/CoreFreq/blob/7c8d354aeed79bcde59d2fcadbab899b60137a74/x86_64/amd_reg.h#L1504

The SMU current temperature CurTmp is subtracted by 49 whenever CurTempRangeSel bit is set to one

Then thermal data are processed by the Daemon:

https://github.com/cyring/CoreFreq/blob/7c8d354aeed79bcde59d2fcadbab899b60137a74/x86_64/corefreqd.c#L1203

and maths start here:

https://github.com/cyring/CoreFreq/blob/7c8d354aeed79bcde59d2fcadbab899b60137a74/x86_64/corefreqd.c#L316

Temperature is finally computed by this macro:

https://github.com/cyring/CoreFreq/blob/7c8d354aeed79bcde59d2fcadbab899b60137a74/x86_64/coretypes.h#L709

Question: do you have any other thermal agent running in parallel of CoreFreq ? For instance, the kernel module k10temp ?

tofurky commented 3 weeks ago

Question: do you have any other thermal agent running in parallel of CoreFreq ? For instance, the kernel module k10temp ?

Yep, k10temp is loaded and polled every 5m by munin-node and much more frequently by MATE sensors applet (Tctl only). Possibly there's 2 calls at once sometimes? There have never been super high values reported by k10temp by munin, though.

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +51.5°C  
Tccd1:        +39.4°C  
Tccd2:        +38.8°C

The graph tends to squash any one time high readings, but each point below represents 15m (3x 5m readings) I think. System uptime from the previous screenshot showing 429C is 18 days uptime.

cyring commented 3 weeks ago

There's an access conflict to the SMU between corefreqk.ko and k10temp which has to be unloaded.

Fyi you can quickly reproduce this conflict by lowering the collect interval to 100ms in the corefreq-cli

tofurky commented 3 weeks ago

I haven't been able to reproduce it yet doing the following:

Running the following to poll k10temp Tctl, Tccd1, Tccd2 as quickly as possible: while true; do cat /sys/class/hwmon/hwmon4/temp{1,3,4}_input ; done
Running corefreq-cli at 100ms interval

Been running for about 5m now, I'll let it run longer and see if I see the high temps.

Edit: after 45m, one core went to 429C. I'll continue testing without k10temp loaded.

cyring commented 3 weeks ago

I haven't been able to reproduce it yet doing the following:

Running the following to poll k10temp Tctl, Tccd1, Tccd2 as quickly as possible: while true; do cat /sys/class/hwmon/hwmon4/temp{1,3,4}_input ; done

Running corefreq-cli at 100ms interval

Been running for about 5m now, I'll let it run longer and see if I see the high temps.

Edit: after 45m, one core went to 429C. I'll continue testing without k10temp loaded.

I did not noticed your Issue Edit.

If you confirm no other SMU agent is running in parallel, it may be possible it is visible since Raphael.

As a day to day, I'm programming on AMD Matisse with particular kernel boot parameters documented in Wiki Once a while, before releasing especially, I'm doing full non regression tests with most kernel modules loaded but k10temp and occasionally amd_pstate

Now rebooting in full modules loaded ...

tofurky commented 3 weeks ago

Ah, sorry about the edit, yeah I didn't want to generate too many e-mails.

I've left it running at 100ms overnight and one core has gone to 429C. k10temp isn't loaded.

matt@aquos:~$ lsmod |grep k10temp
matt@aquos:~$

tofurky commented 3 weeks ago

I'm using amd_pstate=active for CPU frequency control; I imagine this might interact with the SMU in some way? It is available in kernel 6.5+ I believe.

tofurky commented 3 weeks ago

Sorry, make that 2 cores. Needed to scroll down. I was thinking it was a bit odd that if there's 16 physical cores there'd be 32 separate temperature readings.

024 in the screenshot is CCD 1, ID 8 and shows 429C, but 008 is also CCD 1, ID 8 and doesn't show the high reading. Maybe this is just a quirk of how the temperature data is retrieved? Seems like they both should be at 429C though, if it's the same core. I guess CCD_AMD_Family_19h_Zen4_Temp() mentioned above doesn't take topology into account and lets the SMU return a separate temperature for each "core" even if it's just a thread versus physical core?

cyring commented 3 weeks ago

I have a 100ms session running on my Ryzen 9 3950X and issue has not appeared yet.

But looking at your screenshots, can we say that 429 is only displayed in the Max TMP column ?

tofurky commented 3 weeks ago

Yes, it's only shown in the max column. I see 429C in max columns for cores 1, 3, 7, 11, 13, 16, 24, 26 now.

cyring commented 3 weeks ago

I'm using amd_pstate=active for CPU frequency control; I imagine this might interact with the SMU in some way? It is available in kernel 6.5+ I believe.

Depending on the firmware implementation amd-pstate.c is requesting either the ACPI CPPC either MSR registers (similar to Intel HWP) I have not read SMU function references in that driver but it may indirectly.

cyring commented 3 weeks ago

Yes, it's only shown in the max column. I see 429C in max columns for cores 1, 3, 7, 11, 13, 16, 24, 26 now.

Great if I can say; it's narrowing the investigation. It could be a math side effect in the computing macros.

https://github.com/cyring/CoreFreq/blob/7c8d354aeed79bcde59d2fcadbab899b60137a74/x86_64/coretypes.h#L631

tofurky commented 3 weeks ago

Well, I haven't been watching it for hours so it's possible that it appears once in the TMP column before appearing in the Max column.

But, it's always 429C. Never in between e.g. 300C. There's not wrong MIN values, either.

I am running 'corefreq-cli -C > corefreq.out' so I can check if the instantaneous values show 429C.

cyring commented 3 weeks ago

Maybe this is just a quirk of how the temperature data is retrieved?

In fact, sensors are retrieved depending on the sensor scope. In the UI Settings window you can change the thermal scope among 4 possibilities: None, Package, Core or SMT (Thread)

cyring commented 3 weeks ago

Tip: at any time press the star key * to reset values, including the Min and Max of each CPU sensors, frequencies and so on.

cyring commented 3 weeks ago

With an uptime of 11 hours, I have not been able to reproduce the issue.

I'm giving a closed look into source code for certain values reached by your CPU and not mine.

tofurky commented 3 weeks ago

Got it, so it's 429(4967283), should've realized it was 2^32. Well, 0xFFFFFFF3.

matt@aquos:~$ grep -C2 '4294967283' corefreq.out 
009    6.92   244  1.5250   36  000000000000000276    0.004211426   0.042114258
010    6.73   244  1.5250   36  000000000000000277    0.004226685   0.042266846
011    9.31   244  1.5250  4294967283  000000000000000317    0.004837036   0.048370361
012   18.27   244  1.5250   36  000000000000000449    0.006851196   0.068511963
013   11.92   244  1.5250   36  000000000000000291    0.004440308   0.044403076

cyring commented 3 weeks ago

Congrats

= (4294967295) - 12
= 0xFFFFFFFF - 0xC
= (-1U) - 0xC

https://github.com/cyring/CoreFreq/blob/7c8d354aeed79bcde59d2fcadbab899b60137a74/x86_64/coretypes.h#L709

Temp = ((Sensor * 5 / 40) - P1) - P2

P1 can be 49 or 0 depending on CurTempRangeSel bit
P2 is sourced from table and unless a bug has to remain at 0

https://github.com/cyring/CoreFreq/blob/7c8d354aeed79bcde59d2fcadbab899b60137a74/x86_64/corefreqk.h#L7893

I have not found yet but somehow in the formula an unexpected sensor value happened

cyring commented 3 weeks ago

I think I got it. Any sensor value lower than 392 and a thermal offset of 49 lead to a negative temperature. It never happened with my hardware but can with yours.

2024-06-24-234622_639x432_scrot

I have reproduced the issue by injecting those values in driver source code.

2024-06-24-235108_639x362_scrot

cyring commented 3 weeks ago

@tofurky Hello

Please find a fix for your testings in commit 43b96ae67e05b8d978c72bd0b51f7e261a9f212b of the develop branch

tofurky commented 3 weeks ago

@tofurky Hello

Please find a fix for your testings in commit 43b96ae of the develop branch

Thanks very much for looking into this, I will test it now. Yes, the negative number makes a lot of sense.

cyring commented 3 weeks ago

@tofurky Hello Please find a fix for your testings in commit 43b96ae of the develop branch

Thanks very much for looking into this, I will test it now. Yes, the negative number makes a lot of sense.

Hi, Do you have a status and an uptime of the fix running w/o issue ?

tofurky commented 3 weeks ago

The issue hasn't happened again in the past 5 hours since updating.

cyring commented 3 weeks ago

The issue hasn't happened again in the past 5 hours since updating.

Excellent, thank you

cyring commented 3 weeks ago

Hello,

We have an issue with Zen4 voltage in discussion #439 Could you join it and post voltage from your Raphael 7950X ?

First the output of the general overview like:

corefreq-cli -s -n -m -n -M -n -B -n -k

Next for the 3 cases (idle, all Cores stressed, single core stressed) the voltage output with at least 2 rounds of monitoring :

corefreq-cli -V 2

Thank you

tofurky commented 3 weeks ago

Still OK so far with the fix in place. I reloaded k10temp module to see if that changes things.

Info is here: https://github.com/cyring/CoreFreq/discussions/439#discussioncomment-9874109

cyring commented 3 weeks ago

Still OK so far with the fix in place. I reloaded k10temp module to see if that changes things.

Info is here: #439 (comment)

Thank you very much Hope it will last in dual with k10temp

Feel free to close that issue when satisfied

cyring / CoreFreq

Max value hits 429C after some time on 7950X #496