Napsty / check_smart

Monitoring Plugin to check hard drives, solid state drives and NVMe drives using SMART
https://www.claudiokuenzler.com/monitoring-plugins/check_smart.php
GNU General Public License v3.0
67 stars 20 forks source link

Percent_Lifetime_Remain usage #81

Closed eLvErDe closed 2 years ago

eLvErDe commented 2 years ago

Hello,

I'm not sure to understand how to deal with Percent_Lifetime_Remain, it seems the threshold is working up side down:

./check_smart.pl --device=/dev/sdb --interface=megaraid,14 --selftest --ssd-lifetime --warn Percent_Lifetime_Remain=2
WARNING: Drive  CT120BX300SSD1 S/N 1745E10657F2:  Percent_Lifetime_Remain is non-zero (4), |Raw_Read_Error_Rate=0 Reallocate_NAND_Blk_Cnt=0 Power_On_Hours=31882 Power_Cycle_Count=201 Program_Fail_Count=0 Erase_Fail_Count=0 Ave_Block-Erase_Count=137 Unexpect_Power_Loss_Ct=89 Unused_Reserve_NAND_Blk=44 SATA_Interfac_Downshift=0 Error_Correction_Count=0 Reported_Uncorrect=0 Temperature_Celsius=24 Reallocated_Event_Count=0 Current_Pending_ECC_Cnt=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=154 Percent_Lifetime_Remain=4 Write_Error_Rate=0 Success_RAIN_Recov_Cnt=0 Total_LBAs_Written=7218163800 Host_Program_Page_Count=85117225 FTL_Program_Page_Count=131236333
 ./check_smart.pl --device=/dev/sdb --interface=megaraid,14 --selftest --ssd-lifetime --warn Percent_Lifetime_Remain=10
OK: Drive  CT120BX300SSD1 S/N 1745E10657F2: no SMART errors detected.  Percent_Lifetime_Remain is non-zero (4) (but less than threshold 10)|Raw_Read_Error_Rate=0 Reallocate_NAND_Blk_Cnt=0 Power_On_Hours=31882 Power_Cycle_Count=201 Program_Fail_Count=0 Erase_Fail_Count=0 Ave_Block-Erase_Count=137 Unexpect_Power_Loss_Ct=89 Unused_Reserve_NAND_Blk=44 SATA_Interfac_Downshift=0 Error_Correction_Count=0 Reported_Uncorrect=0 Temperature_Celsius=24 Reallocated_Event_Count=0 Current_Pending_ECC_Cnt=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=154 Percent_Lifetime_Remain=4 Write_Error_Rate=0 Success_RAIN_Recov_Cnt=0 Total_LBAs_Written=7218163800 Host_Program_Page_Count=85117225 FTL_Program_Page_Count=131236338

and it seems there's is no support for regular Nagios threshold using colon. Can you help me with that ?

Regards, Adam.

Napsty commented 2 years ago

Hi Adam

Take a look at article https://www.claudiokuenzler.com/blog/1077/when-is-solid-state-drive-ssd-dead-analysis-crucial-mx500-1tb. There's an example how the Percent_Lifetime_Remain is working. The name might be confusing in this situation because smartcl "translates" the raw values into increasing counters.

You want to set a warning threshold to a higher value, for example 90. That's what the -l parameter does in the background (adds Percent_Lifetime_Remain=90 to the warning list).

Note that this attribute only seems to exist on Crucial SSD drives.

eLvErDe commented 2 years ago

Oh, okay so it's working reverse actually, you should write this down somewhere ;) So basically with this attribute being 4 the SSD is still brand new

Napsty commented 2 years ago

It's actually somewhere documented in Smartmontools (that the counters are all counting up) but I will add some more info in the documentation, thanks for the hint. Additional information can also be found here: https://www.claudiokuenzler.com/blog/1056/check-smart-6.9.0-pci-device-name-percent-lifetime-remain-ssd-attribute

Yes, value 4 indicates a fairly new drive.

eLvErDe commented 2 years ago

Thanks a lot for all details and for this great check :)

Napsty commented 2 years ago

Documentation on https://www.claudiokuenzler.com/monitoring-plugins/check_smart.php updated.