Ignore old age attributes

deric commented 2 months ago

Currently it's possible to ignore whole attribute, e.g. 'Critical_Warning', but not only certain value.

0x04 gives us a lot of false positive alerts. It's typically triggered when Percentage_Used is above 100%. This only means that the drive's warranty from the manufacturer is over. But the disk can still keep running for thousands hours.

This PR adds possibility to ignore old age warnings with -O flag.

Napsty commented 2 months ago

@deric PR itself looks good to me, thanks. Would you mind showing the smartctl -a output of this NVMe drive? It would be interesting to see an actual NVMe drive with an "alert" state.

deric commented 2 months ago

@Napsty Sure, how many samples do you need? :slightly_smiling_face:

$ smartctl -a /dev/nvme2n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-16-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZQLB1T9HAJR-00007
Serial Number:                      S439NE0N301755
Firmware Version:                   EDA5202Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,920,383,410,176 [1.92 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,920,383,410,176 [1.92 TB]
Namespace 1 Utilization:            1,920,008,859,648 [1.92 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Wed Sep 11 13:23:44 2024 UTC
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x000f):   Security Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     87 Celsius
Critical Comp. Temp. Threshold:     88 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    10.60W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    137%
Data Units Read:                    22,961,145,011 [11.7 PB]
Data Units Written:                 15,320,817,631 [7.84 PB]
Host Read Commands:                 241,161,914,731
Host Write Commands:                129,187,226,793
Controller Busy Time:               244,416,397,603
Power Cycles:                       10
Power On Hours:                     36,302
Unsafe Shutdowns:                   7
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius
Temperature Sensor 2:               41 Celsius
Temperature Sensor 3:               46 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

check_smart output with the new flag:

$ /usr/lib/nagios/plugins/check_smart -i nvme -d /dev/nvme2n1 -O --skip-self-assessment
OK: Drive  SAMSUNG MZQLB1T9HAJR-00007 S/N S439NE0N301755: no SMART errors detected. |Temperature=38;;;; Available_Spare=100;;;; Available_Spare_Threshold=10;;;; Percentage_Used=137;;;; Data_Units_Read=22961189161;;;; Data_Units_Written=15320833530;;;; Host_Read_Commands=241162294156;;;; Host_Write_Commands=129187328004;;;; Controller_Busy_Time=244424414876;;;; Power_Cycles=10;;;; Power_On_Hours=36302;;;; Unsafe_Shutdowns=7;;;; Media_and_Data_Integrity_Errors=0;;;; Error_Information_Log_Entries=0;;;; Warning__Comp_Temperature_Time=0;;;; Critical_Comp_Temperature_Time=0;;;; Temperature_Sensor_1=38;;;; Temperature_Sensor_2=41;;;; Temperature_Sensor_3=45;;;;

deric commented 2 months ago

percentage used might be about 200%

$ smartctl -a /dev/nvme0n1
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-24-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB512HAJQ-00000
Serial Number:                      S3W8NB0K408420
Firmware Version:                   EXA7301Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            511,971,831,808 [511 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 8481b9d2d7
Local Time is:                      Wed Sep 11 13:28:44 2024 UTC
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.02W       -        -    0  0  0  0        0       0
 1 +     6.30W       -        -    1  1  1  1        0       0
 2 +     3.50W       -        -    2  2  2  2        0       0
 3 -   0.0760W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x04
Temperature:                        46 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    255%
Data Units Read:                    960,228,433 [491 TB]
Data Units Written:                 8,062,584,209 [4.12 PB]
Host Read Commands:                 18,540,409,765
Host Write Commands:                18,576,613,368
Controller Busy Time:               86,782,398,880
Power Cycles:                       31
Power On Hours:                     37,636
Unsafe Shutdowns:                   19
Media and Data Integrity Errors:    0
Error Information Log Entries:      12
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               64 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

Napsty commented 2 months ago

Thanks!

so just to be sure... the new -O output is to deliberately ignore a possible defect (or soon going to be bonkers) drive, similar to --skip-self-assessment, right? Even the internal SMART health check says the drive is FAILED.

Just wanna be on the same path here. Most users will want to know about this and will (hopefully?) switch the drive after getting the first alerts.

This only means that the drive's warranty from the manufacturer is over. But the disk can still keep running for thousands hours.

Is this similar to the TBW "warranty" level? I can only tell from SATA SSD (not NVMe) that even before reaching these, there is significant risk of drive failure, or at least performance issue (recently seen this with a WD Red with 224 TBW, even though it's supposed to hold at least 600 TBW).

IMHO we can push this through, but I'll add a usage warning in --help and documentation.

deric commented 2 months ago

The disk is in FAILED state because Percentage Used > 100%. And this threshold is not very well defined. I've discussed this with servers providers and they are refusing to replace NVMe with Critical_warning 0x04 because the drive is still healthy, typically without any performance degradation.

the new -O output is to deliberately ignore a possible defect (or soon going to be bonkers) drive

Yes. But depends on "soon" definition. This warning means that the there are only ~20k hours of uptime left for your NVMe drive, so you have only 2-3 years to replace it before it actually fails.

TBW "warranty" level

Yes, that might be similar.

Napsty commented 2 months ago

This warning means that the there are only ~20k hours of uptime left for your NVMe drive, so you have only 2-3 years to replace it before it actually fails.

:rofl:

Anyway, thanks for the PR.

Napsty / check_smart

Ignore old age attributes #101