hd-idle task gets out of sync with real disk spindown

berndh1979 commented 7 months ago

Sometimes the hd-idle deamon and the real status of the sata spundown disks gets out of sync. An example below. Perhaps an idea to check the actual status of the spundown (according to the hd-idle deamon) disks against its hd-idle admin.

As you can see the spundown is set for three out of the 4 disks. I checked one of them (sdc) and the disk is actually in active/idle state, while the hd-idle deamon still thinks it is spundown.

Feb 6 19:58:55 puma hd-idle[2516]: disk=sdc command=ata spunDown=true reads=541452 writes=33768360 idleTime=600 idleDuration=109865 spindown=2024-02-05T13:37:51 spinup=2024-02-05T13:27:50 lastIO=2024-02-05T13:27:50

Feb 6 19:58:55 puma hd-idle[2516]: disk=sdb command=ata spunDown=false reads=5555573 writes=14767952 idleTime=600 idleDuration=0 spindown=0001-01-01T00:00:00 spinup=2024-02-04T19:42:39 lastIO=2024-02-06T19:58:55

Feb 6 19:58:55 puma hd-idle[2516]: disk=sda command=ata spunDown=true reads=554924 writes=33768024 idleTime=600 idleDuration=109865 spindown=2024-02-05T13:37:51 spinup=2024-02-05T13:27:50 lastIO=2024-02-05T13:27:50

Feb 6 19:58:55 puma hd-idle[2516]: disk=sdd command=ata spunDown=true reads=566436 writes=33768320 idleTime=600 idleDuration=109865 spindown=2024-02-05T13:37:51 spinup=2024-02-05T13:27:50 lastIO=2024-02-05T13:27:50

Here I check the status of the /dev/sdc disk:

root@puma[~]# hdparm -C /dev/sdc

/dev/sdc: drive state is: active/idle

This is the systemd status of the hd-idle task with its hd-idle options mentioned:

root@puma[~]# systemctl status hd-idle ● hd-idle.service - hd-idle - spin down idle hard disks Loaded: loaded (/lib/systemd/system/hd-idle.service; enabled; preset: enabled) Active: active (running) since Sun 2024-02-04 19:42:39 CET; 2 days ago Docs: man:hd-idle(8) Main PID: 2516 (hd-idle) Tasks: 7 (limit: 18421) Memory: 14.3M CPU: 5.872s CGroup: /system.slice/hd-idle.service └─2516 /usr/sbin/hd-idle -c ata -d -i 600 -l /var/log/hd-idle.log -a /dev/disk/by-id/ata-WDC_WD101EFAX-68LDBN0_xxxxxx -i 600 -a /dev/disk/by-id/ata-WDC_WD101EFAX-68LDBN0_yyyyyyy -i 600 -a /dev/disk/by-id/ata-WDC_W>

Feb 06 19:57:55 puma hd-idle[2516]: disk=sdb command=ata spunDown=false reads=5555573 writes=14758640 idleTime=600 idleDuration=0 spindown=0001-01-01T00:00:00 spinup=2024-02-04T19:42:39 lastIO=2024-02-06T19:57:55 Feb 06 19:57:55 puma hd-idle[2516]: disk=sda command=ata spunDown=true reads=554924 writes=33768024 idleTime=600 idleDuration=109805 spindown=2024-02-05T13:37:51 spinup=2024-02-05T13:27:50 lastIO=2024-02-05T13:27:50 Feb 06 19:58:55 puma hd-idle[2516]: disk=sdc command=ata spunDown=true reads=541452 writes=33768360 idleTime=600 idleDuration=109865 spindown=2024-02-05T13:37:51 spinup=2024-02-05T13:27:50 lastIO=2024-02-05T13:27:50

So it looks like at some point the hd-idle stops checking the actual status of the disks and still thinks they are spundown. It would be an idea to check the status (hdparm -C ) and then reset the 'idleDuration' to 0, if the disk actually spun up.

What are your thoughs?

adelolmo commented 7 months ago

Perhaps an idea to check the actual status of the spundown (according to the hd-idle deamon) disks against its hd-idle admin.

What do you mean by "hd-idle admin"?

So it looks like at some point the hd-idle stops checking the actual status of the disks and still thinks they are spundown.

It's already a known limitation that hd-idle doesn't work as expected if disk scanning tools (like smartools) are watching disks: https://github.com/adelolmo/hd-idle/issues/70

It would be an idea to check the status (hdparm -C ) and then reset the 'idleDuration' to 0, if the disk actually spun up.

That is not how hd-idle is meant to work. hd-idle checks disk activity by supervising /proc/diskstats. If other tools spin up disks without modifying reads or writes activity in /proc/diskstats then hd-idle will be oblivious to the fact.

EDIT: typo

berndh1979 commented 7 months ago

Perhaps an idea to check the actual status of the spundown (according to the hd-idle deamon) disks against its hd-idle admin.

What do you mean by "hd-idle admin"?

Well in the logging I see that the task is keeping the status for each disk. I see the "spundown=true" in the logging for each disk. That is what I meant with 'its administration'.

So it looks like at some point the hd-idle stops checking the actual status of the disks and still thinks they are spundown.

It's already a known limitation that hd-idle doesn't work as expected if disk scanning tools (like smartools) are watching disks: #70

Indeed that is known. With these tools the iostat does not change, hence the hd-idle deamon does not see any writes, hence still 'thinks' that the disk is spundown. However that can only be verified with a command like 'hdparm -C '.

It would be an idea to check the status (hdparm -C ) and then reset the 'idleDuration' to 0, if the disk actually spun up.

That is not how hd-idle is meant to work. hd-idle checks disk activity by supervising /proc/diskstats. If other tools spin up disks without modifying reads or writes activity in /proc/diskstats then hd-idle will be oblivious to the fact.

Correct. I know how the daemon works. However, a once-in-a-while check to verify if the disk is really spun-down and hence restarts the checking would be a possibility to 'sync' again with the real status of the disks.

smgsmagus commented 6 months ago

faced the same problem today with 1.21.

Main concern here is that hd-idle thinks disk is in spindown state, but it's actually not, so it will not spin it down again. Not sure what can be a workaround here? periodic service restart or.. ?

adelolmo commented 5 months ago

Correct. I know how the daemon works. However, a once-in-a-while check to verify if the disk is really spun-down and hence restarts the checking would be a possibility to 'sync' again with the real status of the disks.

Implementing something reliable shouldn't depend on a once-in-a-while check, so the issue I see is how to come up with something bullet-proof that keeps the correct state of the disks after long periods of time. So far this looks to me like a known limitation of the tool.

I wish I could provide a better solution to this problem, but I'm afraid I don't have it yet :-/

berndh1979 commented 5 months ago

Fair comment. I also don't have an easy solution yet. For now I restart the daemon every 24hrs.

What could be done is to do a 'hdparm -C ' for each of the disks that are/should be in spindown. The result then should be "Standby". If not, then the daemon could resets its read/write stats values. Just a thought.

rct commented 2 months ago

I'm not sure if I should open a separate issue or not, but I believe I'm seeing the same issues where hd-idle believes the disks are spun down, but something like smartd or something that ran smartctl (#70 ?) causes the disk to be spun up without performing any actual I/O.

I've spent a bunch of time over the past few days making observations and trying to get (both the old and new verisons of) hd-idle to ... keep idle disks spun down.

I thought it had more to do with SCSI vs. ATA command set or the USB<->SATA adapters that WDC uses in their external drives.

I think I now understand that hd-idle with the current architecture of relying on /proc/diskstats hd-idle will not do what I want unless I can eliminate all of the things that use smartmontools (smartd, hddtemp, etc.).

I think it would save others time if this limitation was more prominently documented, particularly in the README.

Thank you.

CRCinAU commented 2 weeks ago

Yeah - I found this repo when debugging this exact issue with the debian version of hd-idle.

I've been looking at the output of smartctl -i -n standby $i | grep mode - ie the following script:

#!/bin/bash
for i in $@; do
    echo "Power status for $i:"
    smartctl -i -n standby $i | grep mode
done

So, running watch ./drive-status /dev/sd? gives me:

Power status for /dev/sda:
Device is in STANDBY mode, exit(2)
Power status for /dev/sdb:
Device is in STANDBY mode, exit(2)
Power status for /dev/sdc:
Power mode is:    ACTIVE or IDLE
Power status for /dev/sdd:
Power mode is:    ACTIVE or IDLE
Power status for /dev/sde:
Power mode is:    ACTIVE or IDLE

You also don't have to disable smartd - I change the defaults in /etc/smartd.conf as follows:

#DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
DEVICESCAN -d removable -c interval=7200 -n idle,12 -m root -M exec /usr/share/smartmontools/smartd-runner

adelolmo commented 2 weeks ago

@CRCinAU Thank you for sharing your investigation here. If changing the defaults in /etc/smartd.conf does the trick, it would be a huge help for many users.

I would very much appreciate if @rct @berndh1979 or @smgsmagus had a change to test it out :)

berndh1979 commented 2 weeks ago

My current 'smartd.conf' setttings are like this for each of the ssd disk that I have:

/dev/disk/by-id/ata-Samsung_SSD_870_EVO_4TB_xxxxxxxxxxx -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03) -W 15,50,60 -m root

So you are saying I have to change the "-n standby,q" to "-n idle,12" ??

In the description the -n is for: "# -n MODE No check. MODE is one of: never, sleep, standby, idle"

So I was assuming that it would not check the disks if the disks are in 'MODE', and in my case the Standby state. Which is OK. if I change that to 'idle', then essentially the disks are never checked for smart errors, since most of the time the disks are idle or in standby.

The -c interval=7200 means that it will check for smart errors every 2 hours. I have that setting in the generic setting in the /etc/default/smartmontools set to 3600 (every hour)

Thanks for the continuing support this topic.

CRCinAU commented 2 weeks ago

So you are saying I have to change the "-n standby,q" to "-n idle,12" ??

Well - the fun answer is - it depends. Some drives will stay awake if in standby mode and you do a smart query. Some won't.

I'm writing my own script that queries that drive state - as well as the the sector counts as hd-idle does.

idle / standby is probably more suitable for spinning disks - where standby means the drive motor is off. Idle means the drive motor is still on. In an SSD? Who knows.

I have two different spinning drives where one has active and standby states, the other has active, idle_a, idle_b and standby. All mean different things.

On the newer drive, I've had to actually disable EPC states completely on the drive to make it behave well with manual assessments. My script currently outputs stuff like:

Last drive activities: sda 121s ago, sdb 18483s ago,
hdparm results:  sda awake,  sdb standby,
Last drive activities: sda 127s ago, sdb 18489s ago,
...
Last drive activities: sda 175s ago, sdb 18537s ago,
Putting drive sda to sleep
Last drive activities: sda 182s ago, sdb 18544s ago,
hdparm results:  sda standby,  sdb standby,
Last drive activities: sda 188s ago, sdb 18550s ago,
Last drive activities: sda 194s ago, sdb 18556s ago,
Last drive activities: sda 200s ago, sdb 18562s ago,
Last drive activities: sda 206s ago, sdb 18568s ago,

It might well work in a more consistent manner - as the check via hdparm also checks to see if the drive has been spun up by a non-data transfer method - ie a smart check - and treats that as a disk activity too.

Once I've finished testing stuff, I'll end up publishing it somewhere - but its all written in perl :smiley:

EDIT: Oh, and the specific options for smart..... the interval being 7200 = 2 hours, the -n idle,12 actually means:

Don't query the status if the drive is in standby or idle mode; and
If the drive has been in a mode we skip for 12 checks (ie 24 hours), spin the drive up to read the stats.

The default of -n standby,q will always skip and not print any error.

I think its probably a good idea to get at least one smart check complete in a 24 hour window. You can massage these figures how you like - say --interval 3600 -n idle,24 - try every hour, and allow up to 24 failures - and get the same thing. Or even --interval 600 -n idle,144

berndh1979 commented 1 week ago

EDIT: Oh, and the specific options for smart..... the interval being 7200 = 2 hours, the -n idle,12 actually means:
* Don't query the status if the drive is in standby or idle mode; and

* If the drive has been in a mode we skip for 12 checks (ie 24 hours), spin the drive up to read the stats.
The default of -n standby,q will always skip and not print any error.

I think its probably a good idea to get at least one smart check complete in a 24 hour window. You can massage these figures how you like - say --interval 3600 -n idle,24 - try every hour, and allow up to 24 failures - and get the same thing. Or even --interval 600 -n idle,144

Ahhh, that might -n standby,12 might give me a clue why I most of the time see this discrepency in state after 24 hours. If the smart check indeed forces a check after 24 hours, it will spin up the disk and the hd-idle task does not 'see' that.

So indeed with that parameter I need to tweak my system a bit more... thanks for the clarification.

adelolmo / hd-idle

hd-idle task gets out of sync with real disk spindown #113