Napsty / check_smart

Monitoring Plugin to check hard drives, solid state drives and NVMe drives using SMART
https://www.claudiokuenzler.com/monitoring-plugins/check_smart.php
GNU General Public License v3.0
67 stars 20 forks source link

Fix nvme attribute check-list when auto interface is given and device… #97

Closed ymartin-ovh closed 8 months ago

ymartin-ovh commented 8 months ago

… is nvme

ymartin-ovh commented 8 months ago

Got this on nvme device with -i auto:


/usr/lib/nagios/ovh/check_smart -i auto -g /dev/nvme0 --debug
Found /dev/nvme0
###########################################################
(debug) CHECK 1: getting overall SMART health status for  
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -Hi /dev/nvme0

(debug) output:
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.15.41-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

 === START OF INFORMATION SECTION ===
 Model Number:                       SAMSUNG MZVL2512HCJQ-00B07
 Serial Number:                      S63CNF0R415493
 Firmware Version:                   GXA7302Q
 PCI Vendor/Subsystem ID:            0x144d
 IEEE OUI Identifier:                0x002538
 Total NVM Capacity:                 512,110,190,592 [512 GB]
 Unallocated NVM Capacity:           0
 Controller ID:                      6
 Number of Namespaces:               1
 Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
 Namespace 1 Utilization:            462,648,926,208 [462 GB]
 Namespace 1 Formatted LBA Size:     512
 Namespace 1 IEEE EUI-64:            002538 b411b778d4
 Local Time is:                      Wed Mar  6 17:27:20 2024 UTC

 === START OF SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
Model Number:                       SAMSUNG MZVL2512HCJQ-00B07

(debug) found model:  SAMSUNG MZVL2512HCJQ-00B07

(debug) parsing line:
Serial Number:                      S63CNF0R415493

(debug) found serial number S63CNF0R415493

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK
###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -q silent -A /dev/nvme0

(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics from attributes
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -A /dev/nvme0

(debug) output:
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.15.41-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

 === START OF SMART DATA SECTION ===
 SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
 Critical Warning:                   0x00
 Temperature:                        38 Celsius
 Available Spare:                    83%
 Available Spare Threshold:          10%
 Percentage Used:                    19%
 Data Units Read:                    83,833,423 [42.9 TB]
 Data Units Written:                 69,316,785 [35.4 TB]
 Host Read Commands:                 1,241,781,735
 Host Write Commands:                1,632,519,014
 Controller Busy Time:               36,946
 Power Cycles:                       40
 Power On Hours:                     48,708
 Unsafe Shutdowns:                   26
 Media and Data Integrity Errors:    114
 Error Information Log Entries:      114
 Warning  Comp. Temperature Time:    0
 Critical Comp. Temperature Time:    0
 Temperature Sensor 1:               38 Celsius
 Temperature Sensor 2:               48 Celsius

(debug) Raw Check List ATA: Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Erase_Fail_Count_Total
(debug) Raw Check List NVMe: Media_and_Data_Integrity_Errors
(debug) Exclude List for Checks: 
(debug) Exclude List for Perfdata: 
(debug) Warning Thresholds:

(debug) gathered perfdata:

###########################################################
(debug) LOCAL STATUS: OK, FINAL STATUS: OK
###########################################################

(debug) final status/output: OK
(debug) drives  ok: [/dev/nvme0] - Device is clean
(debug) drives nok: 
(debug)   msg_list: [/dev/nvme0] - Device is clean

OK: [/dev/nvme0] - Device is clean|
ymartin-ovh commented 8 months ago

I expect nvme attribute checks when device is nvme and -i auto is given:


Found /dev/nvme0
###########################################################
(debug) CHECK 1: getting overall SMART health status for  
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -Hi /dev/nvme0

(debug) output:
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.15.41-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

 === START OF INFORMATION SECTION ===
 Model Number:                       SAMSUNG MZVL2512HCJQ-00B07
 Serial Number:                      S63CNF0R415493
 Firmware Version:                   GXA7302Q
 PCI Vendor/Subsystem ID:            0x144d
 IEEE OUI Identifier:                0x002538
 Total NVM Capacity:                 512,110,190,592 [512 GB]
 Unallocated NVM Capacity:           0
 Controller ID:                      6
 Number of Namespaces:               1
 Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
 Namespace 1 Utilization:            462,648,926,208 [462 GB]
 Namespace 1 Formatted LBA Size:     512
 Namespace 1 IEEE EUI-64:            002538 b411b778d4
 Local Time is:                      Wed Mar  6 17:36:56 2024 UTC

 === START OF SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
Model Number:                       SAMSUNG MZVL2512HCJQ-00B07

(debug) found model:  SAMSUNG MZVL2512HCJQ-00B07

(debug) parsing line:
Serial Number:                      S63CNF0R415493

(debug) found serial number S63CNF0R415493

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK
###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -q silent -A /dev/nvme0

(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics from attributes
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -A /dev/nvme0

(debug) output:
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.15.41-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

 === START OF SMART DATA SECTION ===
 SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
 Critical Warning:                   0x00
 Temperature:                        38 Celsius
 Available Spare:                    83%
 Available Spare Threshold:          10%
 Percentage Used:                    19%
 Data Units Read:                    83,833,423 [42.9 TB]
 Data Units Written:                 69,317,103 [35.4 TB]
 Host Read Commands:                 1,241,781,735
 Host Write Commands:                1,632,532,652
 Controller Busy Time:               36,946
 Power Cycles:                       40
 Power On Hours:                     48,708
 Unsafe Shutdowns:                   26
 Media and Data Integrity Errors:    114
 Error Information Log Entries:      114
 Warning  Comp. Temperature Time:    0
 Critical Comp. Temperature Time:    0
 Temperature Sensor 1:               38 Celsius
 Temperature Sensor 2:               47 Celsius

(debug) Raw Check List ATA: Current_Pending_Sector Reallocated_Sector_Ct Program_Fail_Cnt_Total Uncorrectable_Error_Cnt Offline_Uncorrectable Runtime_Bad_Block Reported_Uncorrect Reallocated_Event_Count Erase_Fail_Count_Total
(debug) Raw Check List NVMe: Media_and_Data_Integrity_Errors
(debug) Exclude List for Checks: 
(debug) Exclude List for Perfdata: 
(debug) Warning Thresholds:

(debug) Critical_Warning not in raw check list (raw value: 0x00)

(debug) Temperature not in raw check list (raw value: 38)

(debug) Available_Spare not in raw check list (raw value: 83)

(debug) Available_Spare_Threshold not in raw check list (raw value: 10)

(debug) Percentage_Used not in raw check list (raw value: 19)

(debug) Data_Units_Read not in raw check list (raw value: 83833423)

(debug) Data_Units_Written not in raw check list (raw value: 69317103)

(debug) Host_Read_Commands not in raw check list (raw value: 1241781735)

(debug) Host_Write_Commands not in raw check list (raw value: 1632532652)

(debug) Controller_Busy_Time not in raw check list (raw value: 36946)

(debug) Power_Cycles not in raw check list (raw value: 40)

(debug) Power_On_Hours not in raw check list (raw value: 48708)

(debug) Unsafe_Shutdowns not in raw check list (raw value: 26)

(debug) Media_and_Data_Integrity_Errors is non-zero (114)

(debug) Error_Information_Log_Entries not in raw check list (raw value: 114)

(debug) Warning__Comp_Temperature_Time not in raw check list (raw value: 0)

(debug) Critical_Comp_Temperature_Time not in raw check list (raw value: 0)

(debug) Temperature_Sensor_1 not in raw check list (raw value: 38)

(debug) Temperature_Sensor_2 not in raw check list (raw value: 47)

(debug) gathered perfdata:

###########################################################
(debug) LOCAL STATUS: WARNING, FINAL STATUS: WARNING
###########################################################

(debug) final status/output: WARNING
(debug) drives  ok: 
(debug) drives nok: [/dev/nvme0] - [/dev/nvme0] - Media_and_Data_Integrity_Errors is non-zero (114)[/dev/nvme0] - 
(debug)   msg_list: [/dev/nvme0] - [/dev/nvme0] - Media_and_Data_Integrity_Errors is non-zero (114)[/dev/nvme0] - 

WARNING: [/dev/nvme0] - [/dev/nvme0] - Media_and_Data_Integrity_Errors is non-zero (114)[/dev/nvme0] - |
Napsty commented 8 months ago

Awesome find, thanks! Successfully tested on a server with NVME (and ATA) drives.