Enhancement: Switch from hdparm to smartctl for some hdd queries

kaffeemonster commented 5 years ago

I have some WD Red and one Seagate IronWolf 8T (ST8000xxxx). hddfancontrol works great with the WD Red. but it's a pain with the IronWolf. This is a real bummer, because the IronWolf if the hottest drive in the stack (7.2k instead of 5.7k).

The base issue seems to be that the IronWolf implement some new Energy-Foo which hddtemp and hdparm can't properly handle.

hdparm -C can not detect the drive state (comes out as unknown). Already installed last 9.58 release. hddtemp says the drive is sleeping and does not give a temp (needs command parameter "-w").

Example from smartctl -i -n: WD Red ... Power mode is: ACTIVE or IDLE

IronWolf .... Power mode was: IDLE_B

I'm trying to search the web what IDLE_B is or the magic switch for that drive to behave more sane, but that it ignores a lot of hdparm energy management settings is kind of a road block.

smartctl seems to be always a little more on the up-to-date side with these things. Would it make sense to use smartctl for some of the query commands where possible? Maybe switch completely? (See https://www.smartmontools.org/ticket/1017)

desbma commented 5 years ago

Normally I would have told you "not my problem, report the issue to the hddtemp project", but it is unmaintained, so I might consider replacing it.

Can you properly read temperature with smartctl?

Does that work when then drive is spinned down? To test manually, you can usually spin it down with hdparm -y xxx, when no disk activity occurs, and confirm by ear.

kaffeemonster commented 5 years ago

The "spinned down test" has to wait for tomorrow. Need to connect the cold spare.

I mean ATM the IronWolf isn't spun down or in standby, it's just in an idle mode hddtemp doesn't understand because too new?

smartctl can read temps. The classic grep out of the SMART-Values (smartctl -A). And there are the (newer?) SMART Command Transport (SCT) commands.

Oh boy: smartctl -l scttempsts /dev/sdx ... === START OF READ SMART DATA SECTION === SCT Status Version: 3 SCT Version (vendor specific): 522 (0x020a) SCT Support Level: 1 Device State: Active (0) Current Temperature: 41 Celsius Power Cycle Min/Max Temperature: 30/41 Celsius Lifetime Min/Max Temperature: 21/55 Celsius Under/Over Temperature Limit Count: 0/4

There are a lot more infos (complete log with histogram), quite frightening what a modern drive logs...

hddtemp mostly does nothing different then to query the SMART counter for temp. I guess only problem is:

it condenses different SMART-Value-IDs which have come and gone over the years (via hddtemp.db)
applies quirks to some drive readouts?
it sometimes has special access thingies
it is automatic mindfull of the drive sleep state (smartctl can do "-n standby")
has a built in daemon to query, i don't think you can query smartd
much simpler

But hddtemp can only query some drives in spun down mode.

Have to run some tests tomorrow.

Hmmm, have finally found out what IDLE_B is: T10/09-054, T13/452-2008 https://www.seagate.com/files/docs/pdf/whitepaper/tp608-powerchoice-tech-provides-us.pdf

kaffeemonster commented 5 years ago

So, let's see. My IronWolf ST8000 cold spare under test. Drive is spinning and not in standby

Hdparm gives:

hdparm -C /dev/sdc

/dev/sdc: drive state is: unknown

hddtemp gives:

hddtemp /dev/sdc /dev/sdc: ST8000VN0022-xxxxxx: drive is sleeping

ok, again with permission to wake up:

hddtemp -w /dev/sdc /dev/sdc: ST8000VN0022-xxxxxx: 32 C

what has smartctl to say?

smartctl -i -n standby /dev/sdc | grep mode Power mode was: IDLE_B

temps?

smartctl -A -n standby /dev/sdc | grep -i temp 190 Airflow_Temperature_Cel 0x0022 066 066 040 Old_age Always - 34 (Min/Max 25/34) 194 Temperature_Celsius 0x0022 034 040 000 Old_age Always - 34 (0 25 0 0 0)

lets spin the drive down:

hdparm -y /dev/sdc

/dev/sdc: issuing standby command

Drive has spun down. lets check again.

hdparm -C /dev/sdc

/dev/sdc: drive state is: standby

Drive is still spun down. good. Can we read the temp?

hddtemp /dev/sdc /dev/sdc: ST8000VN0022-xxxxxx: drive is sleeping

Nope, no sneaky standby read capability. If we allow wakeup?

hddtemp -w /dev/sdc /dev/sdc: Success

Hmmm, nice temps there ;-) Drive has spun up, but that's definitely not an integer... Maybe that's the reason for the not-an-int exceptions in another github-issue i saw? Let's retest:

hddtemp -w /dev/sdc /dev/sdc: ST8000VN0022-xxxxxx: 35 C

OK. back to standby, hdparm -y, drive has spun down. what does smartctl do?

smartctl -i -n standby /dev/sdc

Device is in STANDBY mode, exit(2)

OK, fair point, we asked smartctl to do nothing if the drive is in standby. Let's test again:

smartctl -i -n sleep /dev/sdc | grep mode Power mode was: STANDBY

Drive is still spun down. Can we get the temps?

smartctl -A -n sleep /dev/sdc | grep -i temp 190 Airflow_Temperature_Cel 0x0022 065 064 040 Old_age Always - 35 (Min/Max 25/36) 194 Temperature_Celsius 0x0022 035 040 000 Old_age Always - 35 (0 25 0 0 0)

Yes, but the drive has spun up, since we didn't block operation at standby.

Let's go for another test, hdparm -y, drive has spun down:

smartctl -n sleep -l scttempsts /dev/sdc
smartctl 6.6 2017-11-05 r4594 (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION === SCT Status Version: 3 SCT Version (vendor specific): 522 (0x020a) SCT Support Level: 1 Device State: Active (0) Current Temperature: 36 Celsius Power Cycle Min/Max Temperature: 25/36 Celsius Lifetime Min/Max Temperature: 25/36 Celsius Under/Over Temperature Limit Count: 0/0

Drive has NOT spun up! I repeat: Drive is still spun down. NICE!!

I marked this issue as enhancement for a reason. At the moment everything mostly works (at least for me), and switching from one program to another is always opening up a can of worms. But if hddtemp does not receive any maintenance (and hdparm also lagging behind for that matter too...), sooner or later this switch may be necessary. I think smartmontools also better handles SAS/SCSI drives (hdparm i believe does not work there, instead you need sdparm). And probably also NVME...

desbma commented 5 years ago

Thanks for the data.

I'll try smartctl with my own drives, and see if the temperature querying seems reliable.

desbma commented 5 years ago

My findings so far:

smartctl -A and the Temperature_Celsius attribute seem to report correct values for my drives (HGST, Seagate and Western Digital)
for some drives, the invocation spins them up, for some it does not
the -n standby parameter only tries to guess if the drive would spin up, but since it varies for drives, it's not that useful (hddfancontrol already stops probing temp if the drive is sleeping)

desbma commented 5 years ago

I have created a smartctl branch, with a new --smartctl switch to force probing using it.

If you could test, and check nothing catches fire, that would be nice.

desbma commented 5 years ago

Also scttempsts is not always supported, but when it is, it never spins drives up.

desbma commented 5 years ago

I have just added support for detection and use of SCT query (smartctl -l scttempsts ...), in the smartctl branch.

If that works well, I'll release an official version with it, and then maybe make it the default.

Advantages I see:

the tool is maintained, and possibly supports more devices
SCT temp query allows querying temperature of drive in stanby mode, for more models than the use of the HGST specific Hdparm method

kaffeemonster commented 5 years ago

Wow, didn't expect such a fast response. Thanks :)

Sorry for the late response, was changing my CPU-Cooler. Downloaded it and am testing right now. But looks good from a first glance.

Some random musing from the top of my head:

I took a peek at your code and the hddtemp.db (It's plain text). SMART-Value-ID 194 is the temp for most drives in there. But there are some, i guess older drives, which use a different ID. It looks like they are legacy and probably dead by now? (cough IBM Death^b^b^bskstar cough). If you want to make it the default, is there a fallback in the code to use hddtemp?

While a majority of drives behave sane and deliver the "normal" temperature, i looked through the hddtemp code and hddtemp.db. There is one drive which reports it's readout in Fahrenheit (Thoshiba 20GB IDE 2.5" drive). I also read somewhere that some, i guess older, drives report generally funky stuff. Like a value between 0 and 255 as a percentage of the drive operational value, or everything shifted one digit to the left. Maybe there should be some sanity checks? (values over 150 properly mean: the drive is on fire)

Is the smartctl output stable with respect to LANG? Maybe you need to force LANG like in the call to hddtemp.

I had a look at the smartctl drive db (https://www.smartmontools.org/browser/trunk/smartmontools/drivedb.h). A search for "temp" revealed a lot of variations (like 194 Drive_Temperature, 194 Temperature_Celsius, 194 Primary_Temperature). Maybe the filter needs to be a little more broad?

Maybe some calls to hdparm could also be killed (hdparm -I, hdparm -C). -C also makes problems ATM (2019-04-28 21:06:06,759 DEBUG [sdc ST8000VN0022-2EL112] Drive state: UNKNOWN), and i don't know if this throws of your internal drive state tracking. Plus i don't know how good hdparm works with NVME and SAS/SCSI drives.

Line 923: bin_dep.check_bin_dependency(("hddtemp", "hdparm")) Does this also need smartctl?

desbma commented 5 years ago

If you want to make it the default, is there a fallback in the code to use hddtemp?

For now, its in a branch, not even in a release, and not enabled unless you use --smartctl. The plan is to push this into a release, and then maybe later revert the behavior so that smartctl is the default, and a switch allows the old way of probing.

While a majority of drives behave sane and deliver the "normal" temperature, i looked through the hddtemp code and hddtemp.db. There is one drive which reports it's readout in Fahrenheit (Thoshiba 20GB IDE 2.5" drive). I also read somewhere that some, i guess older, drives report generally funky stuff. Like a value between 0 and 255 as a percentage of the drive operational value, or everything shifted one digit to the left. Maybe there should be some sanity checks? (values over 150 properly mean: the drive is on fire)

That is my biggest fear, that hddtemp does tons of ugly workarounds that may be needed with smartctl. However it seems (I have not looked in details) that they also "interpret" raw values, and have device specific differences.

Is the smartctl output stable with respect to LANG?

Yes, unlike hddtemp.

I had a look at the smartctl drive db (https://www.smartmontools.org/browser/trunk/smartmontools/drivedb.h). A search for "temp" revealed a lot of variations (like 194 Drive_Temperature, 194 Temperature_Celsius, 194 Primary_Temperature). Maybe the filter needs to be a little more broad?

Thanks, I'll look at this.

Maybe some calls to hdparm could also be killed (hdparm -I, hdparm -C). -C also makes problems ATM (2019-04-28 21:06:06,759 DEBUG [sdc ST8000VN0022-2EL112] Drive state: UNKNOWN), and i don't know if this throws of your internal drive state tracking.

I have thought about this, and it should work (not crash and continue to track temps and adjust fan speeds), but it will just not detect if your drive are spin down.

Plus i don't know how good hdparm works with NVME and SAS/SCSI drives.

I don't know either because I don't have any to test.

Line 923: bin_dep.check_bin_dependency(("hddtemp", "hdparm")) Does this also need smartctl?

Nope because smartctl is only needed if the switch is used, so other users don't need it.

kaffeemonster commented 5 years ago

That is my biggest fear, that hddtemp does tons of ugly workarounds that may be needed with smartctl. However it seems (I have not looked in details) that they also "interpret" raw values, and have device specific differences.

I quickly read through the hddtemp code a little bit, and from my understanding: There is only a drive quirk for "Drive puts out Fahrenheit". That Toshiba.

The rest really boils down to: Find drive type (ATA, SATA, SCSI). If SCSI: Read Drive Log Page "Temperature", do stuff IF ATA/SATA (only different because one is raw ATA commands, the other ATA commands funnelled through Linux-SCSI subsystem):

Is drive in DB? No? no reading
Is SMART-Register in DB 0? no reading
Get SMART Values
return SMART-Register specified in DB
maybe convert this raw value from Celsius to Fahrenheit (or the other way round)

I can't see any special handling (Scalling/capping/translation). Only Unit change (Fahrenheit-Celsius or the other way round). I guess the funky/unsafe value save guard is: the drive is not and will not get in the DB.

Literately: scsi.c dsk->value = buffer[9]; ata.c & sata.c dsk->value = *(field+3);

The values are read straight from the buffer containing the drive answer and are then later passed to printf.

Or i have the wrong source code...

desbma commented 5 years ago

Regarding others 194 attributes:

Temperature_Centigrade is only shown that way for very old smartctl versions, I am not sure if Drive_Temperature is still used at all, and I am definitely sure Primary_Temperature is not used (the code is commented out).

So for now I'd rather only parse Temperature_Celsius, to avoid taking the risk of parsing the wrong output.

kaffeemonster commented 5 years ago

So, i bought an NVME drive to test, and, yeah finally put my system on solid state.

So, what does Smartctl say?

smartctl -A /dev/nvme0 smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.20.17-gentoo] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION === SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff) Critical Warning: 0x00 Temperature: 37 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 419.309 [214 GB] Data Units Written: 379.116 [194 GB] Host Read Commands: 1.712.794 Host Write Commands: 1.563.538 Controller Busy Time: 10 Power Cycles: 4 Power On Hours: 2 Unsafe Shutdowns: 4 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0

So, looks like -A works, only the output looks a lot different...

(after being burned many years ago with an Intel SSD I am still on spinning rust because i don't trust these things...)

desbma commented 5 years ago

Should be parsed with 600e869c44c4bc5c1f13f721b4a57d4dc1b9f85d.

EDIT: The state detection with hdparm is probably not working with NVMe though.

nightah commented 5 years ago

@desbma, it seems some SSD drives can also utilise attribute 190, for example my 2.5" EVO 960.

Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 EVO 500GB
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches

sudo smartctl -A /dev/sdn
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.13-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       11819
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       74
177 Wear_Leveling_Count     0x0013   068   068   000    Pre-fail  Always       -       677
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   063   047   000    Old_age   Always       -       37
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       30
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       88197724304

According to Acronis

Attribute ID: 190 (0xBE)

Hard drives, supporting this attribute
Seagate, Samsung, Western Digital

Description
Temperature Difference from 100 (Airflow Temperature) S.M.A.R.T. parameter indicates the temperature of the air inside the Seagate and Samsung hard disk housing. The value is equal to [100 – specified by manufacturer temperature °C], which allows setting the minimum threshold.

It appears that you could probe for attribute; 194, 190 and then Temperature: for some NVME?

I've checked another Kingston 2.5" and Samsung M.2 SSDs and both of those appears to have attribute 194.

Device Model:     KINGSTON SUV400S37120G
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches

Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZNTY128HDHP-000L1
Rotation Rate:    Solid State Device
Form Factor:      M.2

kaffeemonster commented 5 years ago

Hey @nightah , I'm not very fluent in python, could you check this patch:

https://gist.github.com/kaffeemonster/6983ae3e32bd6edc82a19d39e52e0873

desbma commented 5 years ago

Attribute 190 is now supported with https://github.com/desbma/hddfancontrol/commit/93aeb73d94e457f636214d3eb764e089fd75eda4.

I also have merged all the smartctl code in the master branch, so this will be in the next release.

nightah commented 5 years ago

@desbma I just added a number drives yesterday and it looks like SCSI/SAS drives report slightly different as Current Drive Temperature::

# smartctl -A /dev/sda
smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Current Drive Temperature:     42 C
Drive Trip Temperature:        68 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 1666124337
  Blocks received from initiator = 1517744621
  Blocks read from cache and sent to initiator = 384030649
  Number of read and write commands whose size <= segment size = 21193148
  Number of read and write commands whose size > segment size = 1278317
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 19.86
  number of minutes until next internal SMART test = 108

https://github.com/smartmontools/smartmontools/blob/1f3ff52f06c2c281f7531a6c4bd7dc32eac00201/smartmontools/scsiprint.cpp#L328-L330

desbma commented 5 years ago

Thanks, that output is now also parsed via 0aa4de19dddfd241e405184c3dd788643fc294e3.

desbma commented 4 years ago

Good news: new Linux kernels will have a native driver to read drive temperatures: https://www.phoronix.com/scan.php?page=news_item&px=2020-Linux-Kernel-SATA-Temps http://lkml.iu.edu/hypermail/linux/kernel/1912.1/08676.html

On those kernels, we won't need to call hddtemp or smartctl. Reading a simple file in /sys/class/hwmon will do the job.

kaffeemonster commented 4 years ago

Cool :) I've read news about the NVME-hwmon driver, but missed the new driver for SATA drives.

Has anyone already had a peek at the code if SCSI is also supported?

Could still be beneficial to first query the drives with smartctl to find their SCT Command Transport capability, so hddfancontrol knows if a read from the hwmon node wakes up the drive.

nightah commented 4 years ago

Considering 5.5 is still in RC it'd be interesting to see how far out 5.6 is but this is great news.

@kaffeemonster the posts seem to suggest that this should work for SCSI drives too if I'm interpreting the information correctly. @desbma happy to support with testing where possible when/if you decide to cut over to this.

zalaare commented 3 years ago

I have one SAS drive that I had to remove from my hddfancontrol -d array because it couldn't read the temperature. I just loaded drivetemp module and it does actually work (5.8.18). However I can't begin to tell which hwmon[0-9]* correlates to which drive :(. I hope @desbma you can figure this out.

desbma commented 3 years ago

@zalaare A small script like this will display the available monitors, their name, the value of their sensors and their name if they have any :

for hwmon in $(find /sys/class/hwmon -mindepth 1 -maxdepth 1 -type l)
do 
  echo "$(basename ${hwmon}): $(cat ${hwmon}/name 2> /dev/null || echo ?)"
  for sensor in $(find ${hwmon}/ -name 'temp*_input')
  do
    echo -e "\t- $(basename ${sensor}) ($(cat ${sensor%_*}_label 2> /dev/null || echo ?)): $(cat ${sensor})"
  done
done

You can try the sensors program which also formats and displays sensor data.

The problem with the hwmonX interface is that they can change at every boot...

zalaare commented 3 years ago

I know I said that the drivetemp module worked, but honestly it still doesn't report the SAS drive's temperature. I only noticed when I got around to actually counting the drivetemp-* outputs (I should have had 10, but I get 9).

I've been verifying healthy temperatures via


# find all /dev/sd[x] that are part of existing array
UUID=16693625290857223179

for dev in $(blkid -t UUID=${UUID} -o device) ; do
  if [[ ${dev} =~ 1$ ]]; then
    list+=( ${dev} )
  fi
done

case $1 in
  watch)
    watch -n 5 "hddtemp --quiet ${list[@]} 2>/dev/null"
    ;;
      *)
    for device in ${list[@]} ; do
      temperature=$(hddtemp --quiet ${device} | awk -F ':' '{print $3}' 2>/dev/null)
      echo "${device} ${temperature}"
    done
    ;;
esac

for years now. Luckily having 7 of 8 drives being monitored by hddfancontrol with only a couple case fans effected by result means the SAS drive stays in a similar temperature target anyway.

desbma commented 2 years ago

Temperature querying with smartctl has now been supported as opt-in for a few years in hddfancontrol and seems to work well, so I am closing this. On a related note, I am now testing the native drivetemp kernel module as a temperature source, see https://github.com/desbma/hddfancontrol/issues/34#issuecomment-965782692

desbma / hddfancontrol

Enhancement: Switch from hdparm to smartctl for some hdd queries #21