Closed kaffeemonster closed 2 years ago
Normally I would have told you "not my problem, report the issue to the hddtemp project", but it is unmaintained, so I might consider replacing it.
Can you properly read temperature with smartctl
?
Does that work when then drive is spinned down? To test manually, you can usually spin it down with hdparm -y xxx
, when no disk activity occurs, and confirm by ear.
The "spinned down test" has to wait for tomorrow. Need to connect the cold spare.
I mean ATM the IronWolf isn't spun down or in standby, it's just in an idle mode hddtemp doesn't understand because too new?
smartctl can read temps. The classic grep out of the SMART-Values (smartctl -A). And there are the (newer?) SMART Command Transport (SCT) commands.
Oh boy: smartctl -l scttempsts /dev/sdx ... === START OF READ SMART DATA SECTION === SCT Status Version: 3 SCT Version (vendor specific): 522 (0x020a) SCT Support Level: 1 Device State: Active (0) Current Temperature: 41 Celsius Power Cycle Min/Max Temperature: 30/41 Celsius Lifetime Min/Max Temperature: 21/55 Celsius Under/Over Temperature Limit Count: 0/4
There are a lot more infos (complete log with histogram), quite frightening what a modern drive logs...
hddtemp mostly does nothing different then to query the SMART counter for temp. I guess only problem is:
But hddtemp can only query some drives in spun down mode.
Have to run some tests tomorrow.
Hmmm, have finally found out what IDLE_B is: T10/09-054, T13/452-2008 https://www.seagate.com/files/docs/pdf/whitepaper/tp608-powerchoice-tech-provides-us.pdf
So, let's see. My IronWolf ST8000 cold spare under test. Drive is spinning and not in standby
Hdparm gives:
hdparm -C /dev/sdc
/dev/sdc: drive state is: unknown
hddtemp gives:
hddtemp /dev/sdc /dev/sdc: ST8000VN0022-xxxxxx: drive is sleeping
ok, again with permission to wake up:
hddtemp -w /dev/sdc /dev/sdc: ST8000VN0022-xxxxxx: 32 C
what has smartctl to say?
smartctl -i -n standby /dev/sdc | grep mode Power mode was: IDLE_B
temps?
smartctl -A -n standby /dev/sdc | grep -i temp 190 Airflow_Temperature_Cel 0x0022 066 066 040 Old_age Always - 34 (Min/Max 25/34) 194 Temperature_Celsius 0x0022 034 040 000 Old_age Always - 34 (0 25 0 0 0)
lets spin the drive down:
hdparm -y /dev/sdc
/dev/sdc: issuing standby command
Drive has spun down. lets check again.
hdparm -C /dev/sdc
/dev/sdc: drive state is: standby
Drive is still spun down. good. Can we read the temp?
hddtemp /dev/sdc /dev/sdc: ST8000VN0022-xxxxxx: drive is sleeping
Nope, no sneaky standby read capability. If we allow wakeup?
hddtemp -w /dev/sdc /dev/sdc: Success
Hmmm, nice temps there ;-) Drive has spun up, but that's definitely not an integer... Maybe that's the reason for the not-an-int exceptions in another github-issue i saw? Let's retest:
hddtemp -w /dev/sdc /dev/sdc: ST8000VN0022-xxxxxx: 35 C
OK. back to standby, hdparm -y, drive has spun down. what does smartctl do?
smartctl -i -n standby /dev/sdc
Device is in STANDBY mode, exit(2)
OK, fair point, we asked smartctl to do nothing if the drive is in standby. Let's test again:
smartctl -i -n sleep /dev/sdc | grep mode Power mode was: STANDBY
Drive is still spun down. Can we get the temps?
smartctl -A -n sleep /dev/sdc | grep -i temp 190 Airflow_Temperature_Cel 0x0022 065 064 040 Old_age Always - 35 (Min/Max 25/36) 194 Temperature_Celsius 0x0022 035 040 000 Old_age Always - 35 (0 25 0 0 0)
Yes, but the drive has spun up, since we didn't block operation at standby.
Let's go for another test, hdparm -y, drive has spun down:
smartctl -n sleep -l scttempsts /dev/sdc
smartctl 6.6 2017-11-05 r4594 (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org=== START OF READ SMART DATA SECTION === SCT Status Version: 3 SCT Version (vendor specific): 522 (0x020a) SCT Support Level: 1 Device State: Active (0) Current Temperature: 36 Celsius Power Cycle Min/Max Temperature: 25/36 Celsius Lifetime Min/Max Temperature: 25/36 Celsius Under/Over Temperature Limit Count: 0/0
Drive has NOT spun up! I repeat: Drive is still spun down. NICE!!
I marked this issue as enhancement for a reason. At the moment everything mostly works (at least for me), and switching from one program to another is always opening up a can of worms. But if hddtemp does not receive any maintenance (and hdparm also lagging behind for that matter too...), sooner or later this switch may be necessary. I think smartmontools also better handles SAS/SCSI drives (hdparm i believe does not work there, instead you need sdparm). And probably also NVME...
Thanks for the data.
I'll try smartctl
with my own drives, and see if the temperature querying seems reliable.
My findings so far:
smartctl -A
and the Temperature_Celsius
attribute seem to report correct values for my drives (HGST, Seagate and Western Digital)-n standby
parameter only tries to guess if the drive would spin up, but since it varies for drives, it's not that useful (hddfancontrol
already stops probing temp if the drive is sleeping)I have created a smartctl
branch, with a new --smartctl
switch to force probing using it.
If you could test, and check nothing catches fire, that would be nice.
Also scttempsts
is not always supported, but when it is, it never spins drives up.
I have just added support for detection and use of SCT query (smartctl -l scttempsts ...
), in the smartctl
branch.
If that works well, I'll release an official version with it, and then maybe make it the default.
Advantages I see:
Wow, didn't expect such a fast response. Thanks :)
Sorry for the late response, was changing my CPU-Cooler. Downloaded it and am testing right now. But looks good from a first glance.
Some random musing from the top of my head:
I took a peek at your code and the hddtemp.db (It's plain text). SMART-Value-ID 194 is the temp for most drives in there. But there are some, i guess older drives, which use a different ID. It looks like they are legacy and probably dead by now? (cough IBM Death^b^b^bskstar cough). If you want to make it the default, is there a fallback in the code to use hddtemp?
While a majority of drives behave sane and deliver the "normal" temperature, i looked through the hddtemp code and hddtemp.db. There is one drive which reports it's readout in Fahrenheit (Thoshiba 20GB IDE 2.5" drive). I also read somewhere that some, i guess older, drives report generally funky stuff. Like a value between 0 and 255 as a percentage of the drive operational value, or everything shifted one digit to the left. Maybe there should be some sanity checks? (values over 150 properly mean: the drive is on fire)
Is the smartctl output stable with respect to LANG? Maybe you need to force LANG like in the call to hddtemp.
I had a look at the smartctl drive db (https://www.smartmontools.org/browser/trunk/smartmontools/drivedb.h). A search for "temp" revealed a lot of variations (like 194 Drive_Temperature, 194 Temperature_Celsius, 194 Primary_Temperature). Maybe the filter needs to be a little more broad?
Maybe some calls to hdparm could also be killed (hdparm -I, hdparm -C). -C also makes problems ATM (2019-04-28 21:06:06,759 DEBUG [sdc ST8000VN0022-2EL112] Drive state: UNKNOWN), and i don't know if this throws of your internal drive state tracking. Plus i don't know how good hdparm works with NVME and SAS/SCSI drives.
Line 923: bin_dep.check_bin_dependency(("hddtemp", "hdparm")) Does this also need smartctl?
If you want to make it the default, is there a fallback in the code to use hddtemp?
For now, its in a branch, not even in a release, and not enabled unless you use --smartctl
. The plan is to push this into a release, and then maybe later revert the behavior so that smartctl
is the default, and a switch allows the old way of probing.
While a majority of drives behave sane and deliver the "normal" temperature, i looked through the hddtemp code and hddtemp.db. There is one drive which reports it's readout in Fahrenheit (Thoshiba 20GB IDE 2.5" drive). I also read somewhere that some, i guess older, drives report generally funky stuff. Like a value between 0 and 255 as a percentage of the drive operational value, or everything shifted one digit to the left. Maybe there should be some sanity checks? (values over 150 properly mean: the drive is on fire)
That is my biggest fear, that hddtemp
does tons of ugly workarounds that may be needed with smartctl
. However it seems (I have not looked in details) that they also "interpret" raw values, and have device specific differences.
Is the smartctl output stable with respect to LANG?
Yes, unlike hddtemp
.
I had a look at the smartctl drive db (https://www.smartmontools.org/browser/trunk/smartmontools/drivedb.h). A search for "temp" revealed a lot of variations (like 194 Drive_Temperature, 194 Temperature_Celsius, 194 Primary_Temperature). Maybe the filter needs to be a little more broad?
Thanks, I'll look at this.
Maybe some calls to hdparm could also be killed (hdparm -I, hdparm -C). -C also makes problems ATM (2019-04-28 21:06:06,759 DEBUG [sdc ST8000VN0022-2EL112] Drive state: UNKNOWN), and i don't know if this throws of your internal drive state tracking.
I have thought about this, and it should work (not crash and continue to track temps and adjust fan speeds), but it will just not detect if your drive are spin down.
Plus i don't know how good hdparm works with NVME and SAS/SCSI drives.
I don't know either because I don't have any to test.
Line 923: bin_dep.check_bin_dependency(("hddtemp", "hdparm")) Does this also need smartctl?
Nope because smartctl
is only needed if the switch is used, so other users don't need it.
That is my biggest fear, that hddtemp does tons of ugly workarounds that may be needed with smartctl. However it seems (I have not looked in details) that they also "interpret" raw values, and have device specific differences.
I quickly read through the hddtemp code a little bit, and from my understanding: There is only a drive quirk for "Drive puts out Fahrenheit". That Toshiba.
The rest really boils down to: Find drive type (ATA, SATA, SCSI). If SCSI: Read Drive Log Page "Temperature", do stuff IF ATA/SATA (only different because one is raw ATA commands, the other ATA commands funnelled through Linux-SCSI subsystem):
I can't see any special handling (Scalling/capping/translation). Only Unit change (Fahrenheit-Celsius or the other way round). I guess the funky/unsafe value save guard is: the drive is not and will not get in the DB.
Literately: scsi.c dsk->value = buffer[9]; ata.c & sata.c dsk->value = *(field+3);
The values are read straight from the buffer containing the drive answer and are then later passed to printf.
Or i have the wrong source code...
Regarding others 194 attributes:
Temperature_Centigrade
is only shown that way for very old smartctl versions, I am not sure if Drive_Temperature
is still used at all, and I am definitely sure Primary_Temperature
is not used (the code is commented out).
So for now I'd rather only parse Temperature_Celsius
, to avoid taking the risk of parsing the wrong output.
So, i bought an NVME drive to test, and, yeah finally put my system on solid state.
So, what does Smartctl say?
smartctl -A /dev/nvme0 smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.20.17-gentoo] (local build) Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION === SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff) Critical Warning: 0x00 Temperature: 37 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 419.309 [214 GB] Data Units Written: 379.116 [194 GB] Host Read Commands: 1.712.794 Host Write Commands: 1.563.538 Controller Busy Time: 10 Power Cycles: 4 Power On Hours: 2 Unsafe Shutdowns: 4 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0
So, looks like -A works, only the output looks a lot different...
(after being burned many years ago with an Intel SSD I am still on spinning rust because i don't trust these things...)
Should be parsed with 600e869c44c4bc5c1f13f721b4a57d4dc1b9f85d.
EDIT: The state detection with hdparm
is probably not working with NVMe though.
@desbma, it seems some SSD drives can also utilise attribute 190, for example my 2.5" EVO 960.
Model Family: Samsung based SSDs
Device Model: Samsung SSD 850 EVO 500GB
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
sudo smartctl -A /dev/sdn
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.0.13-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 11819
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 74
177 Wear_Leveling_Count 0x0013 068 068 000 Pre-fail Always - 677
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 063 047 000 Old_age Always - 37
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 30
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 88197724304
According to Acronis
Attribute ID: 190 (0xBE)
Hard drives, supporting this attribute
Seagate, Samsung, Western Digital
Description
Temperature Difference from 100 (Airflow Temperature) S.M.A.R.T. parameter indicates the temperature of the air inside the Seagate and Samsung hard disk housing. The value is equal to [100 – specified by manufacturer temperature °C], which allows setting the minimum threshold.
It appears that you could probe for attribute; 194, 190 and then Temperature:
for some NVME?
I've checked another Kingston 2.5" and Samsung M.2 SSDs and both of those appears to have attribute 194.
Device Model: KINGSTON SUV400S37120G
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Model Family: Samsung based SSDs
Device Model: SAMSUNG MZNTY128HDHP-000L1
Rotation Rate: Solid State Device
Form Factor: M.2
Hey @nightah , I'm not very fluent in python, could you check this patch:
https://gist.github.com/kaffeemonster/6983ae3e32bd6edc82a19d39e52e0873
Attribute 190 is now supported with https://github.com/desbma/hddfancontrol/commit/93aeb73d94e457f636214d3eb764e089fd75eda4.
I also have merged all the smartctl code in the master branch, so this will be in the next release.
@desbma I just added a number drives yesterday and it looks like SCSI/SAS drives report slightly different as Current Drive Temperature:
:
# smartctl -A /dev/sda
smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
Current Drive Temperature: 42 C
Drive Trip Temperature: 68 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = 1666124337
Blocks received from initiator = 1517744621
Blocks read from cache and sent to initiator = 384030649
Number of read and write commands whose size <= segment size = 21193148
Number of read and write commands whose size > segment size = 1278317
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 19.86
number of minutes until next internal SMART test = 108
Thanks, that output is now also parsed via 0aa4de19dddfd241e405184c3dd788643fc294e3.
Good news: new Linux kernels will have a native driver to read drive temperatures: https://www.phoronix.com/scan.php?page=news_item&px=2020-Linux-Kernel-SATA-Temps http://lkml.iu.edu/hypermail/linux/kernel/1912.1/08676.html
On those kernels, we won't need to call hddtemp
or smartctl
. Reading a simple file in /sys/class/hwmon
will do the job.
Cool :) I've read news about the NVME-hwmon driver, but missed the new driver for SATA drives.
Has anyone already had a peek at the code if SCSI is also supported?
Could still be beneficial to first query the drives with smartctl to find their SCT Command Transport capability, so hddfancontrol knows if a read from the hwmon node wakes up the drive.
Considering 5.5 is still in RC it'd be interesting to see how far out 5.6 is but this is great news.
@kaffeemonster the posts seem to suggest that this should work for SCSI drives too if I'm interpreting the information correctly. @desbma happy to support with testing where possible when/if you decide to cut over to this.
I have one SAS drive that I had to remove from my hddfancontrol -d array because it couldn't read the temperature. I just loaded drivetemp module and it does actually work (5.8.18). However I can't begin to tell which hwmon[0-9]* correlates to which drive :(. I hope @desbma you can figure this out.
@zalaare A small script like this will display the available monitors, their name, the value of their sensors and their name if they have any :
for hwmon in $(find /sys/class/hwmon -mindepth 1 -maxdepth 1 -type l)
do
echo "$(basename ${hwmon}): $(cat ${hwmon}/name 2> /dev/null || echo ?)"
for sensor in $(find ${hwmon}/ -name 'temp*_input')
do
echo -e "\t- $(basename ${sensor}) ($(cat ${sensor%_*}_label 2> /dev/null || echo ?)): $(cat ${sensor})"
done
done
You can try the sensors
program which also formats and displays sensor data.
The problem with the hwmonX
interface is that they can change at every boot...
I know I said that the drivetemp module worked, but honestly it still doesn't report the SAS drive's temperature. I only noticed when I got around to actually counting the drivetemp-* outputs (I should have had 10, but I get 9).
I've been verifying healthy temperatures via
# find all /dev/sd[x] that are part of existing array
UUID=16693625290857223179
for dev in $(blkid -t UUID=${UUID} -o device) ; do
if [[ ${dev} =~ 1$ ]]; then
list+=( ${dev} )
fi
done
case $1 in
watch)
watch -n 5 "hddtemp --quiet ${list[@]} 2>/dev/null"
;;
*)
for device in ${list[@]} ; do
temperature=$(hddtemp --quiet ${device} | awk -F ':' '{print $3}' 2>/dev/null)
echo "${device} ${temperature}"
done
;;
esac
for years now. Luckily having 7 of 8 drives being monitored by hddfancontrol with only a couple case fans effected by result means the SAS drive stays in a similar temperature target anyway.
Temperature querying with smartctl
has now been supported as opt-in for a few years in hddfancontrol
and seems to work well, so I am closing this.
On a related note, I am now testing the native drivetemp
kernel module as a temperature source, see https://github.com/desbma/hddfancontrol/issues/34#issuecomment-965782692
I have some WD Red and one Seagate IronWolf 8T (ST8000xxxx). hddfancontrol works great with the WD Red. but it's a pain with the IronWolf. This is a real bummer, because the IronWolf if the hottest drive in the stack (7.2k instead of 5.7k).
The base issue seems to be that the IronWolf implement some new Energy-Foo which hddtemp and hdparm can't properly handle.
hdparm -C can not detect the drive state (comes out as unknown). Already installed last 9.58 release. hddtemp says the drive is sleeping and does not give a temp (needs command parameter "-w").
Example from smartctl -i -n: WD Red ... Power mode is: ACTIVE or IDLE
IronWolf .... Power mode was: IDLE_B
I'm trying to search the web what IDLE_B is or the magic switch for that drive to behave more sane, but that it ignores a lot of hdparm energy management settings is kind of a road block.
smartctl seems to be always a little more on the up-to-date side with these things. Would it make sense to use smartctl for some of the query commands where possible? Maybe switch completely? (See https://www.smartmontools.org/ticket/1017)