Closed tofurky closed 2 weeks ago
Hello,
Thank you for reporting this issue.
Sorry for all details that will follow; I need to simulate the call flow of thermal which starts in this Driver function
Temperature is indeed gotten from SMU; TCCD_REGISTER
address interlaced per CCD
The SMU current temperature CurTmp
is subtracted by 49
whenever CurTempRangeSel
bit is set to one
Then thermal data are processed by the Daemon:
and maths start here:
Temperature is finally computed by this macro:
Question: do you have any other thermal agent running in parallel of CoreFreq ?
For instance, the kernel module k10temp
?
Question: do you have any other thermal agent running in parallel of CoreFreq ? For instance, the kernel module
k10temp
?
Yep, k10temp is loaded and polled every 5m by munin-node and much more frequently by MATE sensors applet (Tctl only). Possibly there's 2 calls at once sometimes? There have never been super high values reported by k10temp by munin, though.
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +51.5°C
Tccd1: +39.4°C
Tccd2: +38.8°C
The graph tends to squash any one time high readings, but each point below represents 15m (3x 5m readings) I think. System uptime from the previous screenshot showing 429C is 18 days uptime.
There's an access conflict to the SMU between corefreqk.ko
and k10temp
which has to be unloaded.
Fyi you can quickly reproduce this conflict by lowering the collect interval to 100ms
in the corefreq-cli
I haven't been able to reproduce it yet doing the following:
while true; do cat /sys/class/hwmon/hwmon4/temp{1,3,4}_input ; done
Been running for about 5m now, I'll let it run longer and see if I see the high temps.
Edit: after 45m, one core went to 429C. I'll continue testing without k10temp loaded.
I haven't been able to reproduce it yet doing the following:
- Running the following to poll k10temp Tctl, Tccd1, Tccd2 as quickly as possible:
while true; do cat /sys/class/hwmon/hwmon4/temp{1,3,4}_input ; done
- Running corefreq-cli at 100ms interval
Been running for about 5m now, I'll let it run longer and see if I see the high temps.
Edit: after 45m, one core went to 429C. I'll continue testing without k10temp loaded.
I did not noticed your Issue Edit.
If you confirm no other SMU agent is running in parallel, it may be possible it is visible since Raphael.
As a day to day, I'm programming on AMD Matisse with particular kernel boot parameters documented in Wiki
Once a while, before releasing especially, I'm doing full non regression tests with most kernel modules loaded but k10temp
and occasionally amd_pstate
Now rebooting in full modules loaded ...
Ah, sorry about the edit, yeah I didn't want to generate too many e-mails.
I've left it running at 100ms overnight and one core has gone to 429C. k10temp isn't loaded.
matt@aquos:~$ lsmod |grep k10temp
matt@aquos:~$
I'm using amd_pstate=active
for CPU frequency control; I imagine this might interact with the SMU in some way? It is available in kernel 6.5+ I believe.
Sorry, make that 2 cores. Needed to scroll down. I was thinking it was a bit odd that if there's 16 physical cores there'd be 32 separate temperature readings.
024 in the screenshot is CCD 1, ID 8 and shows 429C, but 008 is also CCD 1, ID 8 and doesn't show the high reading. Maybe this is just a quirk of how the temperature data is retrieved? Seems like they both should be at 429C though, if it's the same core. I guess CCD_AMD_Family_19h_Zen4_Temp()
mentioned above doesn't take topology into account and lets the SMU return a separate temperature for each "core" even if it's just a thread versus physical core?
I have a 100ms session running on my Ryzen 9 3950X and issue has not appeared yet.
But looking at your screenshots, can we say that 429
is only displayed in the Max TMP column ?
Yes, it's only shown in the max column. I see 429C in max columns for cores 1, 3, 7, 11, 13, 16, 24, 26 now.
I'm using
amd_pstate=active
for CPU frequency control; I imagine this might interact with the SMU in some way? It is available in kernel 6.5+ I believe.
Depending on the firmware implementation amd-pstate.c is requesting either the ACPI CPPC either MSR registers (similar to Intel HWP) I have not read SMU function references in that driver but it may indirectly.
Yes, it's only shown in the max column. I see 429C in max columns for cores 1, 3, 7, 11, 13, 16, 24, 26 now.
Great if I can say; it's narrowing the investigation. It could be a math side effect in the computing macros.
Well, I haven't been watching it for hours so it's possible that it appears once in the TMP column before appearing in the Max column.
But, it's always 429C. Never in between e.g. 300C. There's not wrong MIN values, either.
I am running 'corefreq-cli -C > corefreq.out' so I can check if the instantaneous values show 429C.
Maybe this is just a quirk of how the temperature data is retrieved?
In fact, sensors are retrieved depending on the sensor scope. In the UI Settings window you can change the thermal scope among 4 possibilities: None, Package, Core or SMT (Thread)
Tip: at any time press the star key *
to reset values, including the Min and Max of each CPU sensors, frequencies and so on.
With an uptime of 11 hours, I have not been able to reproduce the issue.
I'm giving a closed look into source code for certain values reached by your CPU and not mine.
Got it, so it's 429(4967283), should've realized it was 2^32. Well, 0xFFFFFFF3.
matt@aquos:~$ grep -C2 '4294967283' corefreq.out
009 6.92 244 1.5250 36 000000000000000276 0.004211426 0.042114258
010 6.73 244 1.5250 36 000000000000000277 0.004226685 0.042266846
011 9.31 244 1.5250 4294967283 000000000000000317 0.004837036 0.048370361
012 18.27 244 1.5250 36 000000000000000449 0.006851196 0.068511963
013 11.92 244 1.5250 36 000000000000000291 0.004440308 0.044403076
Congrats
= (4294967295) - 12
= 0xFFFFFFFF - 0xC
= (-1U) - 0xC
Temp = ((Sensor * 5 / 40) - P1) - P2
P1
can be 49
or 0
depending on CurTempRangeSel
bit
P2
is sourced from table and unless a bug has to remain at 0
I have not found yet but somehow in the formula an unexpected sensor value happened
I think I got it.
Any sensor value lower than 392
and a thermal offset of 49
lead to a negative temperature. It never happened with my hardware but can with yours.
I have reproduced the issue by injecting those values in driver source code.
@tofurky Hello
Please find a fix for your testings in commit 43b96ae67e05b8d978c72bd0b51f7e261a9f212b of the develop
branch
The issue hasn't happened again in the past 5 hours since updating.
The issue hasn't happened again in the past 5 hours since updating.
Excellent, thank you
Hello,
We have an issue with Zen4 voltage in discussion #439 Could you join it and post voltage from your Raphael 7950X ?
First the output of the general overview like:
corefreq-cli -s -n -m -n -M -n -B -n -k
Next for the 3 cases (idle, all Cores stressed, single core stressed) the voltage output with at least 2 rounds of monitoring :
corefreq-cli -V 2
Thank you
Still OK so far with the fix in place. I reloaded k10temp module to see if that changes things.
Info is here: https://github.com/cyring/CoreFreq/discussions/439#discussioncomment-9874109
Still OK so far with the fix in place. I reloaded k10temp module to see if that changes things.
Info is here: #439 (comment)
Thank you very much Hope it will last in dual with k10temp
Feel free to close that issue when satisfied
After running for a while, several cores will eventually get a bad reading of 429C and stick there until corefreqd is restarted.
Ryzen 7950X w/PBO enabled on ASUS ProArt X670E-Creator, kernel 6.6.32 built with gcc 13.2.0 on Ubuntu 24.04.
CoreFreq 117b8c04ef0ef18f877472c268aadc74dd3cffb4 from Wed Feb 28 02:40:18 2024 +0100. Currently now testing with tip of master, but the issue takes some time to show up and I'll update if/when it does.
System PBO offsets were tuned to undervolt most cores and overvolt 1 for stability. RAM is DDR5-6000 with Infinity Fabric at UCLK/1. UEFI is latest available with AGESA 1.1.7.0. No application crashes or other issues seen in dmesg. This issue has been happening as long as I've been letting corefreqd run continuously, so since last August according to .bash_history.
Unsure if this means there's an occasional bad read from the SMU or there's something else at play? Please let me know if you need any other information, thanks.
Screenshot:![image](https://github.com/cyring/CoreFreq/assets/4065513/c53d9f35-94c0-4ee8-9a28-53f96f621e2a)