Closed deric closed 2 months ago
@deric PR itself looks good to me, thanks. Would you mind showing the smartctl -a
output of this NVMe drive? It would be interesting to see an actual NVMe drive with an "alert" state.
@Napsty Sure, how many samples do you need? :slightly_smiling_face:
$ smartctl -a /dev/nvme2n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-16-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZQLB1T9HAJR-00007
Serial Number: S439NE0N301755
Firmware Version: EDA5202Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1,920,383,410,176 [1.92 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
NVMe Version: 1.2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,920,383,410,176 [1.92 TB]
Namespace 1 Utilization: 1,920,008,859,648 [1.92 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Wed Sep 11 13:23:44 2024 UTC
Firmware Updates (0x17): 3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x000f): Security Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 87 Celsius
Critical Comp. Temp. Threshold: 88 Celsius
Namespace 1 Features (0x02): NA_Fields
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 10.60W - - 0 0 0 0 0 0
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 4096 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x04
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 137%
Data Units Read: 22,961,145,011 [11.7 PB]
Data Units Written: 15,320,817,631 [7.84 PB]
Host Read Commands: 241,161,914,731
Host Write Commands: 129,187,226,793
Controller Busy Time: 244,416,397,603
Power Cycles: 10
Power On Hours: 36,302
Unsafe Shutdowns: 7
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 41 Celsius
Temperature Sensor 3: 46 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
check_smart
output with the new flag:
$ /usr/lib/nagios/plugins/check_smart -i nvme -d /dev/nvme2n1 -O --skip-self-assessment
OK: Drive SAMSUNG MZQLB1T9HAJR-00007 S/N S439NE0N301755: no SMART errors detected. |Temperature=38;;;; Available_Spare=100;;;; Available_Spare_Threshold=10;;;; Percentage_Used=137;;;; Data_Units_Read=22961189161;;;; Data_Units_Written=15320833530;;;; Host_Read_Commands=241162294156;;;; Host_Write_Commands=129187328004;;;; Controller_Busy_Time=244424414876;;;; Power_Cycles=10;;;; Power_On_Hours=36302;;;; Unsafe_Shutdowns=7;;;; Media_and_Data_Integrity_Errors=0;;;; Error_Information_Log_Entries=0;;;; Warning__Comp_Temperature_Time=0;;;; Critical_Comp_Temperature_Time=0;;;; Temperature_Sensor_1=38;;;; Temperature_Sensor_2=41;;;; Temperature_Sensor_3=45;;;;
percentage used might be about 200%
$ smartctl -a /dev/nvme0n1
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-24-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLB512HAJQ-00000
Serial Number: S3W8NB0K408420
Firmware Version: EXA7301Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Utilization: 511,971,831,808 [511 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 8481b9d2d7
Local Time is: Wed Sep 11 13:28:44 2024 UTC
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 81 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.02W - - 0 0 0 0 0 0
1 + 6.30W - - 1 1 1 1 0 0
2 + 3.50W - - 2 2 2 2 0 0
3 - 0.0760W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded
SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x04
Temperature: 46 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 255%
Data Units Read: 960,228,433 [491 TB]
Data Units Written: 8,062,584,209 [4.12 PB]
Host Read Commands: 18,540,409,765
Host Write Commands: 18,576,613,368
Controller Busy Time: 86,782,398,880
Power Cycles: 31
Power On Hours: 37,636
Unsafe Shutdowns: 19
Media and Data Integrity Errors: 0
Error Information Log Entries: 12
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 46 Celsius
Temperature Sensor 2: 64 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
Thanks!
so just to be sure... the new -O
output is to deliberately ignore a possible defect (or soon going to be bonkers) drive, similar to --skip-self-assessment
, right? Even the internal SMART health check says the drive is FAILED.
Just wanna be on the same path here. Most users will want to know about this and will (hopefully?) switch the drive after getting the first alerts.
This only means that the drive's warranty from the manufacturer is over. But the disk can still keep running for thousands hours.
Is this similar to the TBW "warranty" level? I can only tell from SATA SSD (not NVMe) that even before reaching these, there is significant risk of drive failure, or at least performance issue (recently seen this with a WD Red with 224 TBW, even though it's supposed to hold at least 600 TBW).
IMHO we can push this through, but I'll add a usage warning in --help
and documentation.
The disk is in FAILED
state because Percentage Used
> 100%
. And this threshold is not very well defined. I've discussed this with servers providers and they are refusing to replace NVMe with Critical_warning 0x04
because the drive is still healthy, typically without any performance degradation.
the new -O output is to deliberately ignore a possible defect (or soon going to be bonkers) drive
Yes. But depends on "soon" definition. This warning means that the there are only ~20k hours of uptime left for your NVMe drive, so you have only 2-3 years to replace it before it actually fails.
TBW "warranty" level
Yes, that might be similar.
This warning means that the there are only ~20k hours of uptime left for your NVMe drive, so you have only 2-3 years to replace it before it actually fails.
:rofl:
Anyway, thanks for the PR.
Currently it's possible to ignore whole attribute, e.g.
'Critical_Warning'
, but not only certain value.0x04
gives us a lot of false positive alerts. It's typically triggered whenPercentage_Used
is above 100%. This only means that the drive's warranty from the manufacturer is over. But the disk can still keep running for thousands hours.This PR adds possibility to ignore old age warnings with
-O
flag.