desbma / hddfancontrol

Regulate fan speed according to hard drive temperature
GNU General Public License v3.0
134 stars 17 forks source link

ERROR [Main] RuntimeError: hddtemp daemon is not returning temp as Celsius #15

Closed adrfantini closed 5 years ago

adrfantini commented 5 years ago

I'm getting the error in the title, which causes hddfancontrol to spin its fan to 100%. The error is rare, and seems to happen at random about twice a day. My setup is pretty weird: I have two Hitachi server HDs, one old WD green, an old sensorless SSD and a USB-connected M.2 drive. hddfancontrol only monitors the hard drives.

This is the log from that particular polling moment, running in debug mode:

gen 22 21:33:20 serverino hddfancontrol[447]: DEBUG [Fan #1] Setting PWM value to 255
gen 22 21:33:20 serverino hddfancontrol[447]: INFO [Fan #1] Setting fan speed to 100%
gen 22 21:33:20 serverino hddfancontrol[447]: ERROR [Main] RuntimeError: hddtemp daemon is not returning temp as Celsius
gen 22 21:33:20 serverino hddfancontrol[447]: DEBUG [sda WDC WD10EARS-00Y5B1] Drive state: ACTIVE_IDLE
gen 22 21:33:20 serverino hddfancontrol[447]: DEBUG [sdb HUP722020APA330] Drive temperature: 30 C
gen 22 21:33:20 serverino hddfancontrol[447]: DEBUG [sdb HUP722020APA330] Drive state: STANDBY
gen 22 21:33:20 serverino hddfancontrol[447]: DEBUG [sdc HUP722020APA330] Drive temperature: 31 C
gen 22 21:33:20 serverino hddfancontrol[447]: DEBUG [sdc HUP722020APA330] Drive state: STANDBY

The hddtemp journald log doesn't say anything, is there any other info I can look for to debug this?

The issue as I see it is not that hddfancontrol seems to be unable to deal with weird data from hddtemp (setting fans to 100% if in doubt is good, even though it might be nice to have this % as an optional flag), but that hddfancontrol seems to be unable to recover from that even when temperatures start to be properly reported again. Right now, I need to manually restart the service to make it work again.

EDIT: I am running with:

HDDFANCONTROL_ARGS="-d /dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F /dev/disk/by-id/ata-HUP722020APA330_BFGWU7WF /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922 --min-temp 35 --max-temp 40 -v debug -p /run/hddfancontrol_fan_hwmon/pwm2 --pwm-start-value 120 --pwm-stop-value 90 --min-fan-speed-prct 0 -i 60 --spin-down-time 600 --hddtemp-daemon"
desbma commented 5 years ago

Is this a new problem with version 1.2.9 or did it already occur before?

The fan set at 100% speed is a safety feature: if the daemon encounters a fatal error it sets maximum speed before exiting.

In your case the daemon does not seem to stop after the error, there are 2 different problems here.

adrfantini commented 5 years ago

I did not see it before 1.2.9, I believe it has been introduced with it. However, I did some setup changes right around the update time so cannot be 100% certain. I am honestly fine with the daemon being solid against these errors: they are often transient and caused, I guess, by some glitch in hddtemp.

desbma commented 5 years ago

Can you run nc 127.0.0.1 7634 and post the output? You may need to install the gnu-netcat package.

adrfantini commented 5 years ago

|/dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F|HUP722020APA330|SLP|*||/dev/disk/by-id/ata-HUP722020APA330_BFGWU7WF|HUP722020APA330|35|C||/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922|WDC WD10EARS-00Y5B1|30|C|

desbma commented 5 years ago

Ok, now can you run: nc 127.0.0.1 7634; sudo hdparm -C /dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F

EDIT: And also: sudo hdparm -H /dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F

adrfantini commented 5 years ago
> nc 127.0.0.1 7634
|/dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F|HUP722020APA330|SLP|*||/dev/disk/by-id/ata-HUP722020APA330_BFGWU7WF|HUP722020APA330|SLP|*||/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922|WDC WD10EARS-00Y5B1|33|C|

> sudo hdparm -C /dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F
/dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F:
 drive state is:  standby

> sudo hdparm -H /dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F
/dev/disk/by-id/ata-HUP722020APA330_BFJ0WS3F:
 drive temperature (celsius) is:  26
 drive temperature in range:  yes
desbma commented 5 years ago

Thanks, and last, when the daemon starts, do you have a log line with Drive does not allow querying temperature without going out of low power mode.? You can check with sudo journalctl -u hddfancontrol | grep ...

adrfantini commented 5 years ago

Thank you. Yes, I get the warning: WARNING [sda WDC WD10EARS-00Y5B1] Drive does not allow querying temperature without going out of low power mode. This however happens for the WD green drive only, not for the Hitachi drives you asked me about earlier. Here is the same info for the WD:

> sudo hdparm -CH /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922

/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922:
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 40 00 21 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 drive temperature (celsius) is:  -18
 drive temperature in range:  yes
 drive state is:  active/idle

If I ask hddtemp directly:

/dev/sda: WDC WD10EARS-00Y5B1: 34°C
/dev/sdb: HUP722020APA330: 28°C
/dev/sdc: HUP722020APA330: 32°C
/dev/sdd: SK hynix SC215 M.2 2242: S.M.A.R.T. not available
WARNING: Drive /dev/sde doesn't seem to have a temperature sensor.
WARNING: This doesn't mean it hasn't got one.
WARNING: If you are sure it has one, please contact me (hddtemp@guzu.net).
WARNING: See --help, --debug and --drivebase options.
/dev/sde: SAMSUNG MMCRE64G5MXP-0VB                :  no sensor
desbma commented 5 years ago

I did a quick fix in https://github.com/desbma/hddfancontrol/commit/c9530d1903d4c3c59252a1f5089b366a4635b011, can you test it?

A quick and dirty way for you to test:

cp -v /usr/lib/python3.7/site-packages/hddfancontrol/__init__.py{,.orig}
curl https://raw.githubusercontent.com/desbma/hddfancontrol/c9530d1903d4c3c59252a1f5089b366a4635b011/hddfancontrol/__init__.py > /usr/lib/python3.7/site-packages/hddfancontrol/__init__.py
systemctl restart hddfancontrol
adrfantini commented 5 years ago

Thanks, that's great! I'm testing, let's give it a couple of days to see if the problem reappears or not.

desbma commented 5 years ago

@adrfantini So what do your NAS logs say?

adrfantini commented 5 years ago

The error did not present itself again, apparently. It might be a statistical fluctuation, but usually in 2.5 days it should have happened a couple of times. I think it's fixed! Feel free to close, I believe, I'll continue monitoring and will reopen if it pops up again. Thanks a lot!

desbma commented 5 years ago

Great!

adrfantini commented 5 years ago

Installed some new drives, and apparently it happens again. I am using the cpu-temp branch, but this is not related to -c.

Complete output:

2019-02-06 21:18:48,788 INFO [Main] Process real time scheduler set to 2, priority 49
2019-02-06 21:18:48,898 WARNING [sdb ST3000VX010-2E3166] Drive does not allow querying temperature without going out of low power mode.
2019-02-06 21:18:49,028 WARNING [sdc ST3000VX010-2E3166] Drive does not allow querying temperature without going out of low power mode.
2019-02-06 21:18:49,118 WARNING [sda WDC WD10EARS-00Y5B1] Drive does not allow querying temperature without going out of low power mode.
2019-02-06 21:18:49,133 DEBUG [sdb ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-06 21:18:49,134 DEBUG [sdc ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-06 21:18:49,134 DEBUG [sdb ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-06 21:18:49,136 DEBUG [DriveSpinDownThread-sdb ST3000VX010-2E3166] Sleeping for 60 seconds
2019-02-06 21:18:49,137 DEBUG [DriveSpinDownThread-sdc ST3000VX010-2E3166] Sleeping for 60 seconds
2019-02-06 21:18:49,137 DEBUG [sda WDC WD10EARS-00Y5B1] Drive state: STANDBY
2019-02-06 21:18:49,138 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Drive is already sleeping
2019-02-06 21:18:49,138 DEBUG [Main] Drive sdb ST3000VX010-2E3166 is in low power state, unable to query temperature
2019-02-06 21:18:49,138 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Sleeping for 60 seconds
2019-02-06 21:18:49,143 DEBUG [sdc ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-06 21:18:49,145 ERROR [Main] RuntimeError: hddtemp daemon is not returning temp as Celsius
2019-02-06 21:18:49,145 INFO [Fan #1] Setting fan speed to 100%
2019-02-06 21:18:49,146 DEBUG [DriveSpinDownThread-sdc ST3000VX010-2E3166] Sleep interrupted
2019-02-06 21:18:49,146 INFO [DriveSpinDownThread-sdc ST3000VX010-2E3166] Exiting
2019-02-06 21:18:49,146 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Sleep interrupted
2019-02-06 21:18:49,146 DEBUG [DriveSpinDownThread-sdb ST3000VX010-2E3166] Sleep interrupted
2019-02-06 21:18:49,147 INFO [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Exiting
2019-02-06 21:18:49,147 INFO [DriveSpinDownThread-sdb ST3000VX010-2E3166] Exiting
2019-02-06 21:18:49,150 WARNING [Fan #1] /run/hddfancontrol_fan_hwmon/pwm2_enable was 0, setting it to 1
2019-02-06 21:18:49,150 DEBUG [Fan #1] Setting PWM value to 255

hddtemp directly:

> hddtemp /dev/sd?
/dev/sda: WDC WD10EARS-00Y5B1: drive is sleeping
/dev/sdb: ST3000VX010-2E3166: drive is sleeping
/dev/sdc: ST3000VX010-2E3166: drive is sleeping
/dev/sdd: ASMT 2115:  drive supported, but it doesn't have a temperature sensor.

hddtemp socket:

> nc localhost 7634 |sed 's/|//m' | sed 's/||/ \n/g' | awk -F'|' '{print $1 " " $3 " " $4}'
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730QNWR SLP * 
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164 ERR * 
/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922 SLP *

hdparm:

> hdparm -CH /dev/disk/by-id/ata-ST3000VX010-2E3166_Z???????

/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730QNWR:
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 40 00 21 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 drive temperature (celsius) is:  -18
 drive temperature in range:  yes
 drive state is:  active/idle

/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164:
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 40 00 21 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 drive temperature (celsius) is:  -18
 drive temperature in range:  yes
 drive state is:  active/idle
desbma commented 5 years ago

Can you run hdparm -C /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164 just when you reproduce the issue?

adrfantini commented 5 years ago

I am trying to reproduce it, but it only seems to happen at times. I'll keep you informed.

adrfantini commented 5 years ago

Happened again. Right now it is reproducible. All relevant diagnostic commands below, ran one after the other:

> hddfancontrol -d /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730QNWR /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922 --min-temp 32 --max-temp 42 -p /run/hddfancontrol_fan_hwmon/pwm2 --pwm-start-value 110 --pwm-stop-value 90 --min-fan-speed-prct 0 -i 60 --spin-down-time 600 --hddtemp-daemon -v debug ; hdparm -C /dev/disk/by-id/{ata-ST3000VX010-2E3166_Z730QNWR,ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922,ata-ST3000VX010-2E3166_Z730R164} ;  hddtemp /dev/sd? ; nc localhost 7634 |sed 's/|//m' | sed 's/||/ \n/g' | awk -F'|' '{print $1 " " $3 " " $4}'

2019-02-07 10:53:34,409 INFO [Main] Process real time scheduler set to 2, priority 49
2019-02-07 10:53:34,616 WARNING [sdb ST3000VX010-2E3166] Drive does not allow querying temperature without going out of low power mode.
2019-02-07 10:53:35,798 WARNING [sdc ST3000VX010-2E3166] Drive does not allow querying temperature without going out of low power mode.
2019-02-07 10:53:35,876 WARNING [sda WDC WD10EARS-00Y5B1] Drive does not allow querying temperature without going out of low power mode.
2019-02-07 10:53:35,894 DEBUG [sdb ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-07 10:53:35,894 DEBUG [sda WDC WD10EARS-00Y5B1] Drive state: STANDBY
2019-02-07 10:53:35,895 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Drive is already sleeping
2019-02-07 10:53:35,895 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Sleeping for 60 seconds
2019-02-07 10:53:35,896 DEBUG [DriveSpinDownThread-sdb ST3000VX010-2E3166] Sleeping for 60 seconds
2019-02-07 10:53:35,896 DEBUG [sdb ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-07 10:53:35,897 DEBUG [sdb ST3000VX010-2E3166] Drive temperature: 24 °C
2019-02-07 10:53:35,977 DEBUG [sdc ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-07 10:53:35,978 DEBUG [sdc ST3000VX010-2E3166] Drive state: ACTIVE_IDLE
2019-02-07 10:53:35,979 ERROR [Main] RuntimeError: hddtemp daemon is not returning temp as Celsius
2019-02-07 10:53:35,979 INFO [Fan #1] Setting fan speed to 100%
2019-02-07 10:53:35,979 DEBUG [DriveSpinDownThread-sdb ST3000VX010-2E3166] Sleep interrupted
2019-02-07 10:53:35,980 INFO [DriveSpinDownThread-sdb ST3000VX010-2E3166] Exiting
2019-02-07 10:53:35,980 DEBUG [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Sleep interrupted
2019-02-07 10:53:35,980 WARNING [Fan #1] /run/hddfancontrol_fan_hwmon/pwm2_enable was 0, setting it to 1
2019-02-07 10:53:35,981 DEBUG [DriveSpinDownThread-sdc ST3000VX010-2E3166] Sleeping for 60 seconds
2019-02-07 10:53:35,981 INFO [DriveSpinDownThread-sda WDC WD10EARS-00Y5B1] Exiting
2019-02-07 10:53:35,982 DEBUG [DriveSpinDownThread-sdc ST3000VX010-2E3166] Sleep interrupted
2019-02-07 10:53:35,983 DEBUG [Fan #1] Setting PWM value to 255
2019-02-07 10:53:35,983 INFO [DriveSpinDownThread-sdc ST3000VX010-2E3166] Exiting

/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730QNWR:
 drive state is:  active/idle

/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922:
 drive state is:  standby

/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164:
 drive state is:  active/idle
/dev/sda: WDC WD10EARS-00Y5B1: drive is sleeping
/dev/sdb: ST3000VX010-2E3166: 24°C
/dev/sdc: ST3000VX010-2E3166: 23°C
/dev/sdd: ASMT 2115:  drive supported, but it doesn't have a temperature sensor.
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730QNWR 24 C 
/dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164 ERR * 
/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WCAV5R693922 SLP *

The 2 Seagate drives are currently under heavy use (btrfs scrub).

desbma commented 5 years ago

The drive /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164 looks very suspicious to me. It reports itself as active/idle, but hddtemp fails to get its temperature.

Does it report any SMART errors? You can check with:

sudo smartctl -t short /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164
... wait for 1 or 2 minutes
sudo smartctl -H -l selftest -l error /dev/disk/by-id/ata-ST3000VX010-2E3166_Z730R164
adrfantini commented 5 years ago

Both Seagate drives are identical and brand new and I tested them with SMART a few times before - no errors (new test in progress, takes some time since the drive is being stressed):

smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.19-1-lts] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Self-test routine in progress 10%        18         -
# 2  Short offline       Completed without error       00%        17         -
# 3  Short offline       Completed without error       00%         4         -
# 4  Short offline       Completed without error       00%         4         -

Running without --hddtemp-daemon works without any issue.

desbma commented 5 years ago

I think I understand what i going on.

When hddfancontrol asks the hddtemp daemon the temperature for that drive, it fails because the drive is very busy (I assume the query times out), because of the running btrfs scrub. When hddfancontrol starts a hddtemp process, it inherits the high priority of hddfancontrol, so that works normally.

So you have several solutions:

In the mean time, I'll try to better detect and report such errors.

desbma commented 5 years ago

Also you can try to reproduce the issue when the drive are not busy, to confirm my theory.

adrfantini commented 5 years ago

Indeed forcing the hddtemp daemon to run in high-prio seems to fix the ERR in the netcat. This in turn of course fixes hddfancontrol. Nice catch!

desbma commented 5 years ago

Fixed in 14bdf21dcc752035de7f7fef5c955ce51a3ec36a

I have also merged the change in the cpu-temp branch.

adrfantini commented 5 years ago

Thank you!