Napsty / check_smart

Monitoring Plugin to check hard drives, solid state drives and NVMe drives using SMART
https://www.claudiokuenzler.com/monitoring-plugins/check_smart.php
GNU General Public License v3.0
67 stars 20 forks source link

Warning thresholds does NOT give the expected result. #60

Closed tachtler closed 4 years ago

tachtler commented 4 years ago

Hi,

first of all, thank you for writing check_smart.pl.

When I try to check the temperature the given argument will not be honored:

# ./check_smart.pl -d /dev/sda -i megaraid,11 -w 'Temperature_Celsius=35' --debug

I would expect, that the temperature from smartctl (37° C) will be over the limit of 35° C an a WARNING should be displayed, BUT OK will be shown.

See the --debug output below:

# ./check_smart.pl -d /dev/sda -i megaraid,11 -w 'Temperature_Celsius=35' --debug
Found /dev/sda
###########################################################
(debug) CHECK 1: getting overall SMART health status for /dev/sda 
###########################################################

(debug) executing:
sudo smartctl -d megaraid,11 -Hi /dev/sda

(debug) output:
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1127.19.1.el7.x86_64] (local build)
 Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

 === START OF INFORMATION SECTION ===
 Model Family:     Western Digital RE4
 Device Model:     WDC WD1003FBYX-01Y7B1
 Serial Number:    WD-WMAW31105150
 LU WWN Device Id: 5 0014ee 206d2bb92
 Firmware Version: 01.01V02
 User Capacity:    1.000.204.886.016 bytes [1,00 TB]
 Sector Size:      512 bytes logical/physical
 Rotation Rate:    7200 rpm
 Device is:        In smartctl database [for details use: -P show]
 ATA Version is:   ATA8-ACS (minor revision not indicated)
 SATA Version is:  SATA 3.0, 3.0 Gb/s (current: 3.0 Gb/s)
 Local Time is:    Mon Oct  5 08:04:52 2020 CEST
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled

 === START OF READ SMART DATA SECTION ===
 SMART Status not supported: ATA return descriptor not supported by controller firmware
 SMART overall-health self-assessment test result: PASSED
 Warning: This result is based on an Attribute check.

 Last login: Mo Okt  5 08:04:40 CEST 2020 on pts/0

(debug) parsing line:
Device Model:     WDC WD1003FBYX-01Y7B1

(debug) found model:  WDC WD1003FBYX-01Y7B1

(debug) parsing line:
Serial Number:    WD-WMAW31105150

(debug) found serial number WD-WMAW31105150

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK

###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################

(debug) executing:
sudo smartctl -d megaraid,11 -q silent -A /dev/sda

Last login: Mo Okt  5 08:04:52 CEST 2020 on pts/0
(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics from attributes
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################

(debug) executing:
sudo smartctl -d megaraid,11 -A /dev/sda

(debug) output:
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1127.19.1.el7.x86_64] (local build)
 Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

 === START OF READ SMART DATA SECTION ===
 SMART Attributes Data Structure revision number: 16
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       3
   3 Spin_Up_Time            0x0027   186   175   021    Pre-fail  Always       -       3675
   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       874
   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
   7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
   9 Power_On_Hours          0x0032   017   017   000    Old_age   Always       -       60999
  10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       45
 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       35
 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       838
 194 Temperature_Celsius     0x0022   110   107   000    Old_age   Always       -       37
 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
 198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
 200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

 Last login: Mo Okt  5 08:04:52 CEST 2020 on pts/0

(debug) Raw Check List: Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count
(debug) Exclude List for Checks: 
(debug) Exclude List for Perfdata: 
(debug) Warning Thresholds:
Temperature_Celsius=35

(debug) Raw_Read_Error_Rate not in raw check list (raw value: 3)

(debug) Spin_Up_Time not in raw check list (raw value: 3675)

(debug) Start_Stop_Count not in raw check list (raw value: 874)

(debug) Reallocated_Sector_Ct is OK (0)

(debug) Seek_Error_Rate not in raw check list (raw value: 0)

(debug) Power_On_Hours not in raw check list (raw value: 60999)

(debug) Spin_Retry_Count not in raw check list (raw value: 0)

(debug) Calibration_Retry_Count not in raw check list (raw value: 0)

(debug) Power_Cycle_Count not in raw check list (raw value: 45)

(debug) Power-Off_Retract_Count not in raw check list (raw value: 35)

(debug) Load_Cycle_Count not in raw check list (raw value: 838)

(debug) Temperature_Celsius not in raw check list (raw value: 37)

(debug) Reallocated_Event_Count is OK (0)

(debug) Current_Pending_Sector is OK (0)

(debug) Offline_Uncorrectable is OK (0)

(debug) UDMA_CRC_Error_Count not in raw check list (raw value: 0)

(debug) Multi_Zone_Error_Rate not in raw check list (raw value: 0)

(debug) gathered perfdata:
Raw_Read_Error_Rate=3 Spin_Up_Time=3675 Start_Stop_Count=874 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=60999 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=45 Power-Off_Retract_Count=35 Load_Cycle_Count=838 Temperature_Celsius=37 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0

###########################################################
(debug) LOCAL STATUS: OK, FINAL STATUS: OK
###########################################################

(debug) final status/output: OK
(debug) drives  ok: 
(debug) drives nok: 
(debug)   msg_list: Drive  WDC WD1003FBYX-01Y7B1 S/N WD-WMAW31105150: no SMART errors detected. 

OK: Drive  WDC WD1003FBYX-01Y7B1 S/N WD-WMAW31105150: no SMART errors detected. |Raw_Read_Error_Rate=3 Spin_Up_Time=3675 Start_Stop_Count=874 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=60999 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=45 Power-Off_Retract_Count=35 Load_Cycle_Count=838 Temperature_Celsius=37 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0

Did I misunderstood something?

Thank you! Klaus.

Napsty commented 4 years ago

The reason is that Temperature_Celsius attribute is by default not being checked. You can see this in the debug output:

(debug) Temperature_Celsius not in raw check list (raw value: 37)

So you first need to add that attribute to the raw list by replacing the default raw list using -r.

Note that the SMART self-assessment test will report, when the drive exceeds the "healthy" temperature. This will be picked up by the plugin and report critical. See https://www.claudiokuenzler.com/blog/881/hard-drive-high-temperature-causes-application-issues-latency for a real life example.

tachtler commented 4 years ago

Hi Claudio,

thank you!

Klaus.