Closed adrfantini closed 5 years ago
Is this a new problem with version 1.2.9 or did it already occur before?
The fan set at 100% speed is a safety feature: if the daemon encounters a fatal error it sets maximum speed before exiting.
In your case the daemon does not seem to stop after the error, there are 2 different problems here.
I did not see it before 1.2.9, I believe it has been introduced with it. However, I did some setup changes right around the update time so cannot be 100% certain. I am honestly fine with the daemon being solid against these errors: they are often transient and caused, I guess, by some glitch in hddtemp
.
Can you run nc 127.0.0.1 7634
and post the output?
You may need to install the gnu-netcat package.
|/dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F|HUP722020APA330|SLP|*||/dev/disk/by-id/ata-HUP722020APA330_BFGWU7WF|HUP722020APA330|35|C||/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922|WDC WD10EARS-00Y5B1|30|C|
Ok, now can you run:
nc 127.0.0.1 7634; sudo hdparm -C /dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F
EDIT: And also: sudo hdparm -H /dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F
> nc 127.0.0.1 7634
|/dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F|HUP722020APA330|SLP|*||/dev/disk/by-id/ata-HUP722020APA330_BFGWU7WF|HUP722020APA330|SLP|*||/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922|WDC WD10EARS-00Y5B1|33|C|
> sudo hdparm -C /dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F
/dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F:
drive state is: standby
> sudo hdparm -H /dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F
/dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F:
drive temperature (celsius) is: 26
drive temperature in range: yes
Thanks, and last, when the daemon starts, do you have a log line with Drive does not allow querying temperature without going out of low power mode.
? You can check with sudo journalctl -u hddfancontrol | grep ...
Thank you. Yes, I get the warning:
WARNING [sda WDC WD10EARS-00Y5B1] Drive does not allow querying temperature without going out of low power mode.
This however happens for the WD green drive only, not for the Hitachi drives you asked me about earlier. Here is the same info for the WD:
> sudo hdparm -CH /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922
/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922:
SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 04 51 40 00 21 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
drive temperature (celsius) is: -18
drive temperature in range: yes
drive state is: active/idle
If I ask hddtemp
directly:
/dev/sda: WDC WD10EARS-00Y5B1: 34°C
/dev/sdb: HUP722020APA330: 28°C
/dev/sdc: HUP722020APA330: 32°C
/dev/sdd: SK hynix SC215 M.2 2242: S.M.A.R.T. not available
WARNING: Drive /dev/sde doesn't seem to have a temperature sensor.
WARNING: This doesn't mean it hasn't got one.
WARNING: If you are sure it has one, please contact me (hddtemp@guzu.net).
WARNING: See --help, --debug and --drivebase options.
/dev/sde: SAMSUNG MMCRE64G5MXP-0VB : no sensor
I did a quick fix in https://github.com/desbma/hddfancontrol/commit/c9530d1903d4c3c59252a1f5089b366a4635b011, can you test it?
A quick and dirty way for you to test:
cp -v /usr/lib/python3.7/site-packages/hddfancontrol/__init__.py{,.orig}
curl https://raw.githubusercontent.com/desbma/hddfancontrol/c9530d1903d4c3c59252a1f5089b366a4635b011/hddfancontrol/__init__.py > /usr/lib/python3.7/site-packages/hddfancontrol/__init__.py
systemctl restart hddfancontrol
Thanks, that's great! I'm testing, let's give it a couple of days to see if the problem reappears or not.
@adrfantini So what do your NAS logs say?
The error did not present itself again, apparently. It might be a statistical fluctuation, but usually in 2.5 days it should have happened a couple of times. I think it's fixed! Feel free to close, I believe, I'll continue monitoring and will reopen if it pops up again. Thanks a lot!
Great!
Installed some new drives, and apparently it happens again. I am using the cpu-temp branch, but this is not related to -c
.
Complete output:
2019-02-06 21:18:48,788 INFO [Main] Process real time scheduler set to 2, priority 49
2019-02-06 21:18:48,898 WARNING [sdb ST3000VX010-2E3166] Drive does not allow querying temperature without going out of low power mode.
2019-02-06 21:18:49,028 WARNING [sdc ST3000VX010-2E3166] Drive does not allow querying temperature without going out of low power mode.
2019-02-06 21:18:49,118 WARNING [sda WDC WD10EARS-00Y5B1] Drive does not allow querying temperature without going out of low power mode.
2019-02-06 21:18:49,133 DEBUG [sdb ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-06 21:18:49,134 DEBUG [sdc ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-06 21:18:49,134 DEBUG [sdb ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-06 21:18:49,136 DEBUG [DriveSpinDownThread-sdb ST3000VX010-2E3166] Sleeping for 60 seconds
2019-02-06 21:18:49,137 DEBUG [DriveSpinDownThread-sdc ST3000VX010-2E3166] Sleeping for 60 seconds
2019-02-06 21:18:49,137 DEBUG [sda WDC WD10EARS-00Y5B1] Drive state: STANDBY
2019-02-06 21:18:49,138 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Drive is already sleeping
2019-02-06 21:18:49,138 DEBUG [Main] Drive sdb ST3000VX010-2E3166 is in low power state, unable to query temperature
2019-02-06 21:18:49,138 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Sleeping for 60 seconds
2019-02-06 21:18:49,143 DEBUG [sdc ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-06 21:18:49,145 ERROR [Main] RuntimeError: hddtemp daemon is not returning temp as Celsius
2019-02-06 21:18:49,145 INFO [Fan #1] Setting fan speed to 100%
2019-02-06 21:18:49,146 DEBUG [DriveSpinDownThread-sdc ST3000VX010-2E3166] Sleep interrupted
2019-02-06 21:18:49,146 INFO [DriveSpinDownThread-sdc ST3000VX010-2E3166] Exiting
2019-02-06 21:18:49,146 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Sleep interrupted
2019-02-06 21:18:49,146 DEBUG [DriveSpinDownThread-sdb ST3000VX010-2E3166] Sleep interrupted
2019-02-06 21:18:49,147 INFO [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Exiting
2019-02-06 21:18:49,147 INFO [DriveSpinDownThread-sdb ST3000VX010-2E3166] Exiting
2019-02-06 21:18:49,150 WARNING [Fan #1] /run/hddfancontrol_fan_hwmon/pwm2_enable was 0, setting it to 1
2019-02-06 21:18:49,150 DEBUG [Fan #1] Setting PWM value to 255
hddtemp
directly:
> hddtemp /dev/sd?
/dev/sda: WDC WD10EARS-00Y5B1: drive is sleeping
/dev/sdb: ST3000VX010-2E3166: drive is sleeping
/dev/sdc: ST3000VX010-2E3166: drive is sleeping
/dev/sdd: ASMT 2115: drive supported, but it doesn't have a temperature sensor.
hddtemp
socket:
> nc localhost 7634 |sed 's/|//m' | sed 's/||/ \n/g' | awk -F'|' '{print $1 " " $3 " " $4}'
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730QNWR SLP *
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164 ERR *
/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922 SLP *
hdparm
:
> hdparm -CH /dev/disk/by-id/ata-ST3000VX010-2E3166_Z???????
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730QNWR:
SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 04 51 40 00 21 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
drive temperature (celsius) is: -18
drive temperature in range: yes
drive state is: active/idle
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164:
SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 04 51 40 00 21 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
drive temperature (celsius) is: -18
drive temperature in range: yes
drive state is: active/idle
Can you run hdparm -C /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164
just when you reproduce the issue?
I am trying to reproduce it, but it only seems to happen at times. I'll keep you informed.
Happened again. Right now it is reproducible. All relevant diagnostic commands below, ran one after the other:
> hddfancontrol -d /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730QNWR /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922 --min-temp 32 --max-temp 42 -p /run/hddfancontrol_fan_hwmon/pwm2 --pwm-start-value 110 --pwm-stop-value 90 --min-fan-speed-prct 0 -i 60 --spin-down-time 600 --hddtemp-daemon -v debug ; hdparm -C /dev/disk/by-id/{ata-ST3000VX010-2E3166_Z730QNWR,ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922,ata-ST3000VX010-2E3166_Z730R164} ; hddtemp /dev/sd? ; nc localhost 7634 |sed 's/|//m' | sed 's/||/ \n/g' | awk -F'|' '{print $1 " " $3 " " $4}'
2019-02-07 10:53:34,409 INFO [Main] Process real time scheduler set to 2, priority 49
2019-02-07 10:53:34,616 WARNING [sdb ST3000VX010-2E3166] Drive does not allow querying temperature without going out of low power mode.
2019-02-07 10:53:35,798 WARNING [sdc ST3000VX010-2E3166] Drive does not allow querying temperature without going out of low power mode.
2019-02-07 10:53:35,876 WARNING [sda WDC WD10EARS-00Y5B1] Drive does not allow querying temperature without going out of low power mode.
2019-02-07 10:53:35,894 DEBUG [sdb ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-07 10:53:35,894 DEBUG [sda WDC WD10EARS-00Y5B1] Drive state: STANDBY
2019-02-07 10:53:35,895 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Drive is already sleeping
2019-02-07 10:53:35,895 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Sleeping for 60 seconds
2019-02-07 10:53:35,896 DEBUG [DriveSpinDownThread-sdb ST3000VX010-2E3166] Sleeping for 60 seconds
2019-02-07 10:53:35,896 DEBUG [sdb ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-07 10:53:35,897 DEBUG [sdb ST3000VX010-2E3166] Drive temperature: 24 °C
2019-02-07 10:53:35,977 DEBUG [sdc ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-07 10:53:35,978 DEBUG [sdc ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-07 10:53:35,979 ERROR [Main] RuntimeError: hddtemp daemon is not returning temp as Celsius
2019-02-07 10:53:35,979 INFO [Fan #1] Setting fan speed to 100%
2019-02-07 10:53:35,979 DEBUG [DriveSpinDownThread-sdb ST3000VX010-2E3166] Sleep interrupted
2019-02-07 10:53:35,980 INFO [DriveSpinDownThread-sdb ST3000VX010-2E3166] Exiting
2019-02-07 10:53:35,980 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Sleep interrupted
2019-02-07 10:53:35,980 WARNING [Fan #1] /run/hddfancontrol_fan_hwmon/pwm2_enable was 0, setting it to 1
2019-02-07 10:53:35,981 DEBUG [DriveSpinDownThread-sdc ST3000VX010-2E3166] Sleeping for 60 seconds
2019-02-07 10:53:35,981 INFO [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Exiting
2019-02-07 10:53:35,982 DEBUG [DriveSpinDownThread-sdc ST3000VX010-2E3166] Sleep interrupted
2019-02-07 10:53:35,983 DEBUG [Fan #1] Setting PWM value to 255
2019-02-07 10:53:35,983 INFO [DriveSpinDownThread-sdc ST3000VX010-2E3166] Exiting
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730QNWR:
drive state is: active/idle
/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922:
drive state is: standby
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164:
drive state is: active/idle
/dev/sda: WDC WD10EARS-00Y5B1: drive is sleeping
/dev/sdb: ST3000VX010-2E3166: 24°C
/dev/sdc: ST3000VX010-2E3166: 23°C
/dev/sdd: ASMT 2115: drive supported, but it doesn't have a temperature sensor.
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730QNWR 24 C
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164 ERR *
/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922 SLP *
The 2 Seagate drives are currently under heavy use (btrfs scrub
).
The drive /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164
looks very suspicious to me.
It reports itself as active/idle, but hddtemp fails to get its temperature.
Does it report any SMART errors? You can check with:
sudo smartctl -t short /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164
... wait for 1 or 2 minutes
sudo smartctl -H -l selftest -l error /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164
Both Seagate drives are identical and brand new and I tested them with SMART a few times before - no errors (new test in progress, takes some time since the drive is being stressed):
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.19-1-lts] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Self-test routine in progress 10% 18 -
# 2 Short offline Completed without error 00% 17 -
# 3 Short offline Completed without error 00% 4 -
# 4 Short offline Completed without error 00% 4 -
Running without --hddtemp-daemon works without any issue.
I think I understand what i going on.
When hddfancontrol asks the hddtemp daemon the temperature for that drive, it fails because the drive is very busy (I assume the query times out), because of the running btrfs scrub. When hddfancontrol starts a hddtemp process, it inherits the high priority of hddfancontrol, so that works normally.
So you have several solutions:
btrfs scrub -c 3 ...
)echo '[Service]
CPUSchedulingPolicy=rr
CPUSchedulingPriority=48
IOSchedulingClass=realtime' > /etc/systemd/system/hddtemp.service.d/high_priority.conf
systemctl daemon-reload
systemctl restart hddtemp
--hddtemp-daemon
In the mean time, I'll try to better detect and report such errors.
Also you can try to reproduce the issue when the drive are not busy, to confirm my theory.
Indeed forcing the hddtemp
daemon to run in high-prio seems to fix the ERR
in the netcat
. This in turn of course fixes hddfancontrol
. Nice catch!
Fixed in 14bdf21dcc752035de7f7fef5c955ce51a3ec36a
I have also merged the change in the cpu-temp
branch.
Thank you!
I'm getting the error in the title, which causes
hddfancontrol
to spin its fan to 100%. The error is rare, and seems to happen at random about twice a day. My setup is pretty weird: I have two Hitachi server HDs, one old WD green, an old sensorless SSD and a USB-connected M.2 drive.hddfancontrol
only monitors the hard drives.This is the log from that particular polling moment, running in debug mode:
The
hddtemp
journald log doesn't say anything, is there any other info I can look for to debug this?The issue as I see it is not that
hddfancontrol
seems to be unable to deal with weird data fromhddtemp
(setting fans to 100% if in doubt is good, even though it might be nice to have this % as an optional flag), but thathddfancontrol
seems to be unable to recover from that even when temperatures start to be properly reported again. Right now, I need to manually restart the service to make it work again.EDIT: I am running with: