Open B0ndo2 opened 8 months ago
Which hwmon input is producing those numbers? There was some discussion about blacklisting in #25, but I'd be intrigued to find out which sensor is actually producing these values, especially since it's not static in this case.
I don't know which ones, how do I find out ?
@B0ndo2
Try the (attached) script - maybe it can help. get_temps.txt
Output looks like this: `Directory: /sys/class/hwmon/hwmon0/ Name: BAT0
Directory: /sys/class/hwmon/hwmon1/ Name: nvme temp1_input: 32850 (Composite) temp3_input: 67850 (Sensor 2)
Directory: /sys/class/hwmon/hwmon2/ Name: amdgpu temp1_input: 43000 (edge)
Directory: /sys/class/hwmon/hwmon3/ Name: AC
Directory: /sys/class/hwmon/hwmon4/ Name: acpitz temp1_input: 44000
Directory: /sys/class/hwmon/hwmon5/ Name: k10temp temp1_input: 44375 (Tctl)
Directory: /sys/class/hwmon/hwmon6/ Name: thinkpad temp1_input: 44000 (CPU) cat: /sys/class/hwmon/hwmon6/temp2_input: No such device or address temp2_input: (GPU) temp3_input: 44000 temp4_input: 0 temp5_input: 44000 temp6_input: 44000 temp7_input: 44000 temp8_input: 0
Directory: /sys/class/hwmon/hwmon7/ Name: ath11k_hwmon temp1_input: 41000 `
(Hu - seem like /sys/class/hwmon/hwmon6/temp2_input is not readable on my system...)
Here is the output
Directory: /sys/class/hwmon/hwmon0/
Name: AC
Directory: /sys/class/hwmon/hwmon1/
Name: acpitz
temp1_input: 65000
Directory: /sys/class/hwmon/hwmon2/
Name: BAT0
Directory: /sys/class/hwmon/hwmon3/
Name: nvme
temp1_input: 49850 (Composite)
temp2_input: 49850 (Sensor 1)
temp3_input: 45850 (Sensor 2)
Directory: /sys/class/hwmon/hwmon4/
Name: ucsi_source_psy_USBC000:001
Directory: /sys/class/hwmon/hwmon5/
Name: ucsi_source_psy_USBC000:002
Directory: /sys/class/hwmon/hwmon6/
Name: thinkpad
temp1_input: 65000 (CPU)
cat: /sys/class/hwmon/hwmon6/temp2_input: No such device or address
temp2_input: (GPU)
temp3_input: 47000
temp4_input: 0
temp5_input: 37000
temp6_input: 57000
temp7_input: 54000
cat: /sys/class/hwmon/hwmon6/temp8_input: No such device or address
temp8_input:
Directory: /sys/class/hwmon/hwmon7/
Name: coretemp
temp1_input: 58000 (Package id 0)
temp2_input: 58000 (Core 0)
temp6_input: 58000 (Core 4)
Directory: /sys/class/hwmon/hwmon8/
Name: iwlwifi_1
temp1_input: 53000
I suspect I have the same issue. There seems to be a mysterious high temperature being detected that isn't shown anywhere else. For example:
Jul 26 15:33:40 swarman-ThinkPad-P1-Gen-7 zcfan[129732]: [FAN] Temperature now 79C, fan set to medium
Jul 26 15:34:00 swarman-ThinkPad-P1-Gen-7 zcfan[129732]: [FAN] Temperature now 50C, fan set to low
Jul 26 15:40:52 swarman-ThinkPad-P1-Gen-7 zcfan[129732]: [FAN] Temperature now 72C, fan set to medium
Jul 26 15:45:12 swarman-ThinkPad-P1-Gen-7 zcfan[129732]: [FAN] Temperature now 50C, fan set to low
Jul 26 15:49:47 swarman-ThinkPad-P1-Gen-7 zcfan[129732]: [FAN] Temperature now 73C, fan set to medium
checking the sensors around the same time as the last entry shows:
Fri 26 Jul 15:52:12 BST 2024
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: +45.0°C
ucsi_source_psy_USBC000:003-isa-0000
Adapter: ISA adapter
in0: 0.00 V (min = +0.00 V, max = +0.00 V)
curr1: 0.00 A (max = +0.00 A)
ucsi_source_psy_USBC000:001-isa-0000
Adapter: ISA adapter
in0: 0.00 V (min = +0.00 V, max = +0.00 V)
curr1: 0.00 A (max = +0.00 A)
BAT0-acpi-0
Adapter: ACPI interface
in0: 17.74 V
thinkpad-isa-0000
Adapter: ISA adapter
fan1: 3642 RPM
fan2: 3725 RPM
CPU: +51.0°C
GPU: +46.0°C
temp3: +48.0°C
temp4: +0.0°C
temp5: +46.0°C
temp6: +47.0°C
temp7: +38.0°C
temp8: N/A
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +52.0°C (high = +110.0°C, crit = +110.0°C)
Core 0: +46.0°C (high = +110.0°C, crit = +110.0°C)
Core 1: +46.0°C (high = +110.0°C, crit = +110.0°C)
Core 2: +46.0°C (high = +110.0°C, crit = +110.0°C)
Core 3: +46.0°C (high = +110.0°C, crit = +110.0°C)
Core 4: +45.0°C (high = +110.0°C, crit = +110.0°C)
Core 5: +45.0°C (high = +110.0°C, crit = +110.0°C)
Core 6: +45.0°C (high = +110.0°C, crit = +110.0°C)
Core 7: +45.0°C (high = +110.0°C, crit = +110.0°C)
Core 8: +46.0°C (high = +110.0°C, crit = +110.0°C)
Core 12: +45.0°C (high = +110.0°C, crit = +110.0°C)
Core 16: +46.0°C (high = +110.0°C, crit = +110.0°C)
Core 20: +46.0°C (high = +110.0°C, crit = +110.0°C)
Core 24: +47.0°C (high = +110.0°C, crit = +110.0°C)
Core 28: +47.0°C (high = +110.0°C, crit = +110.0°C)
Core 32: +50.0°C (high = +110.0°C, crit = +110.0°C)
Core 33: +50.0°C (high = +110.0°C, crit = +110.0°C)
ucsi_source_psy_USBC000:002-isa-0000
Adapter: ISA adapter
in0: 0.00 V (min = +0.00 V, max = +0.00 V)
curr1: 0.00 A (max = +0.00 A)
nvme-pci-0400
Adapter: PCI adapter
Composite: +33.9°C (low = -20.1°C, high = +77.8°C)
(crit = +81.8°C)
Sensor 1: +33.9°C (low = -273.1°C, high = +65261.8°C)
acpitz-acpi-0
Adapter: ACPI interface
temp1: +51.0°C (crit = +108.0°C)
or via cat /proc/acpi/ibm/thermal
Fri 26 Jul 15:53:57 BST 2024
temperatures: 53 47 49 0 47 49 38 -128
Nothing seems to be close to the ~70 degree temp being reported. Do you have any ideas where it might be coming from?
edit: After watching the output of the script given above, I think what happens is there are very short temperature spikes. I'm not sure if they're real or errors in the sensor data.
Observing just the temp1 of the cpu:
while true; do date && cat /sys/class/hwmon/hwmon7/temp1_input; sleep 0.1; done;
for less than a second the temp seems to jump up and down several degrees:
Fri 26 Jul 17:16:13 BST 2024
67000
Fri 26 Jul 17:16:13 BST 2024
67000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:13 BST 2024
81000
Fri 26 Jul 17:16:14 BST 2024
81000
Fri 26 Jul 17:16:14 BST 2024
81000
Fri 26 Jul 17:16:14 BST 2024
67000
If zcfan notices this jump it'll often kick the fan into a higher mode for a while. But it doesn't seem necessary to do so. I wonder if it should keep a running 1-2 second average to avoid spikes?
Hey @stefancircuit
Wow, that is strange.
Are you saying you measured a temp difference from 67 to 81, within 100ms? That sounds rather impossible...? I would think the thermal mass...
Were you running any specific load on the system at the time?
What is your hwmon7's name? (you can run my script posted above). Mine is "ath11k_hwmon", which is wifi.
(Lots of docs suggest the number of the "hwmon" sensors can jump between boots... I don't think I've observed that however).
Anyway - if your hwmon7 is your wifi... it might not be covered by the CPU cooling heatpipes... like mine (T16 G1 AMD): https://laptopmedia.com/wp-content/uploads/2022/08/internals-1000x711.jpg
Sooo, if that is the case, you may have a similar problem to what I had - needing to be able to blacklist a sensor from zcfan, from it's fan-management algorithm, which just takes the highest temperature of any sensor, and uses that to set the fan speed.
See: https://github.com/cdown/zcfan/issues/25
I've reached a dead end with my issue... My SSD exposed some additional dud temp sensor, that was always stuck on the same temp, and causing issues in the zcfan algo...
In the end I modified zcfan to hardcode an exclusion on that one sensor on my laptop, and that solved the algo issue, but, it exposed an new issue... that is - when setting the desired fan level by writing to /proc/acpi/ibm/fan - the whole system may crash. At random.
It is actually something I observed with zcfan and my experimentation with it... I thought it was my own bad code (but how... zcfan is in userspace...) - but later I wrote my own "zcfan" in python to read sensors, accommodate sensor blacklisting, compute brackets, and set the fan level accordingly. It worked great, except my system would still hard crash - at random. No amount of playing with the intervals of writing the fan level or the watchdog timer, could fix this.
I don't know how to troubleshoot this further. I'm guessing this issue might be unique to the combination of the IBM ACPI driver and my motherboard/bios - else... zcfan would not for work anyone...
Actually, the original problem I had, that lead me to try zcfan, was that the auto fan speed control of the system had an issue... Most of the time it would be fine, but then sometimes it would get into a loop of spinning up and down, over and over, quite fast. Probably up and down, in about 5 seconds. Over and over. Even if no load on the system.
I think fundamentally, the IBM ACPI driver, which does run in kernel-space, is not that good, and it leads to these issues we've seen-
Hmmm, come to think of it, the fluctuating fan speed issue I had, is kinda similar to what you and @B0ndo2 reported, but maybe at a faster pace? maybe it has to do with the polling frequency set up in zcfan... maybe its fundamentally driven by the same issue.
I'm not sure how to troubleshoot this issue further, or where to go for help.
@stefancircuit - have you had any random system crashes while experimenting with zcfan?
If not - you can try that sensor blacklisting route... or if your hwmon7 temp1 is some part of your CPU/GPU, and you do want it to drive your fan speed... maybe you can pre-process those readings via a moving average or a low-pass-filter or something like that to smooth out the bumps...
Last question - which version of the IBM ACPI driver are you running? You can check with: cat /proc/ibm/acpi/driver
I'm on 0.26
Thanks!
Hi thanks for your response @rudolf81 . I was not running any particular load, just idling. hwmon7
is the CPU For me:
Directory: /sys/class/hwmon/hwmon7/
Name: coretemp
temp1_input: 47000 (Package id 0)
temp2_input: 41000 (Core 0)
temp3_input: 41000 (Core 1)
temp4_input: 40000 (Core 2)
temp5_input: 41000 (Core 3)
temp6_input: 40000 (Core 4)
temp7_input: 41000 (Core 5)
temp8_input: 40000 (Core 6)
temp9_input: 41000 (Core 7)
My assumption was that temp1
was the closest thing to the "overall" CPU temperature, however it might just be the max of all the cores or something. So I don't think blacklisting would work in my case.
I don't get any crashes just these (possibly spurious) sub-second temperature spikes that push the fan speed up randomly.
The acpi version is as follows:
sudo cat /proc/acpi/ibm/driver
driver: ThinkPad ACPI Extras
version: 0.26
But yeah, it seems like a rolling average would help smooth out anomalies.
That being said I'm currently running Ubuntu 22, and I tried 24 over the weekend which seems to just fix the problems. I'm not sure why, or what changed but the temps are down and the fan stays mostly off without extra tools. This is a very new laptop model (P1 Gen7) so maybe it just needs whatever mysterious packages exist in the newer OS. :man_shrugging:
I'll still keep using zcfan for a bit as I cannot upgrade fully yet, but hopefully that will be the longer term solution.
Hi @stefancircuit
Interesting.
You would think that the temperatures you get, via querying the sysft or via procfs, somehow come directly from the actual sensors of the components.
Updating some OS packages, are not likely to alter what those readings are?? (...unless they already have some smoothing algo applied? but... you'd think that would be a concern 1 layer above - not from the actual sensors themselves...)
Anyway - when you get back into Ubuntu 24 - would be awesome if you can share the version of Thinkpad ACPI Extras driver.
Thanks.
I have a ThinkPad T14s gen 3 where I installed zcfan. The fan is cycling like crazy. I am monitoring the CPU and GPU temperature and they never reached 70 or 61. I also feel that the fun runs at high speed always
Mar 13 16:21:58 XX zcfan[11590]: [FAN] Temperature now 63C, fan set to low Mar 13 16:22:26 XX zcfan[11590]: [FAN] Temperature now 50C, fan set to off Mar 13 16:24:52 XX zcfan[11590]: [FAN] Temperature now 76C, fan set to medium Mar 13 16:24:56 XX zcfan[11590]: [FAN] Temperature now 47C, fan set to off Mar 13 16:25:45 XX zcfan[11590]: [FAN] Temperature now 70C, fan set to low Mar 13 16:25:49 XX zcfan[11590]: [FAN] Temperature now 48C, fan set to off Mar 13 16:25:59 XX zcfan[11590]: [FAN] Temperature now 61C, fan set to low Mar 13 16:26:02 XX zcfan[11590]: [FAN] Temperature now 48C, fan set to off Mar 13 16:26:22 XX zcfan[11590]: [FAN] Temperature now 61C, fan set to low Mar 13 16:26:25 XX zcfan[11590]: [FAN] Temperature now 47C, fan set to off
zcfan.conf
max_temp 85 med_temp 75 low_temp 60