Napsty / check_smart

Monitoring Plugin to check hard drives, solid state drives and NVMe drives using SMART
https://www.claudiokuenzler.com/monitoring-plugins/check_smart.php
GNU General Public License v3.0
67 stars 20 forks source link

Wear_Leveling_Count is not reported as CRIT when disk is almost dead #73

Closed pschonmann closed 2 years ago

pschonmann commented 2 years ago

We found some disks thats almost dead because Wear_Leveling_Count standarded value is 1. Raw data are probably usefull only for vendor software because https://superuser.com/questions/1037644/samsung-ssd-wear-leveling-count-meaning

Disk is Device Model: Samsung SSD 850 EVO 500GB

and check is reporting OK :(

OK: Drive Samsung SSD 850 EVO 500GB S/N S3R3NF0J: no SMART errors detected. |Reallocated_Sector_Ct=0 Power_On_Hours=20837 Power_Cycle_Count=2 Wear_Leveling_Count=2981 Used_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 Uncorrectable_Error_Cnt=0 Airflow_Temperature_Cel=24 ECC_Error_Rate=0 CRC_Error_Count=0 POR_Recovery_Count=0 Total_LBAs_Written=572549452098

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
       5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
       9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       20837
      12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       2
     177 Wear_Leveling_Count     0x0013   001   001   000    Pre-fail  Always       -       2981
     179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
     181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
     182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
     183 Runtime_Bad_Block       0x0013   100   099   010    Pre-fail  Always       -       0
     187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
     190 Airflow_Temperature_Cel 0x0032   076   060   000    Old_age   Always       -       24
     195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
     199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
     235 POR_Recovery_Count      0x0012   100   100   000    Old_age   Always       -       0
     241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       572549492484
SMART Error Log Version: 1
No Errors Logged
Napsty commented 2 years ago

Thanks for reporting this. In #36 I tried to determine which attributes should be default be added into the raw (check) list. It seems that this attribute 177 Wear_Leveling_Count is not used by all SSD models. I can see this attribute on my Samsung (Samsung SSD 850 EVO 500GB) SSDs, but not on SanDisk or Western Digital SSDs.

Now the big question is whether this Wear_Leveling_Count attribute is really a strong/important indicator of pending drive failure. Do you have any official Samsung documentation at hand?

_(A good comparison is the Total_BadBlock attribute, which is used by some SSD models. The name itself sounds alarming yet the values can vary a lot, even for brand new drives, and they don't really show a pending failure).

Also in the superuser link you posted, someone mentions:

All of your drives are at between 95 and 100, and will eventually drop to 0.

I'm not sure that this is correct. The attribute counters shown by smartctl usually start at 0 and increase to 100. This can be seen in Crucial MX SSDs with the Percent_Lifetime_Remain attributes (see https://www.claudiokuenzler.com/blog/1077/when-is-solid-state-drive-ssd-dead-analysis-crucial-mx500-1tb for a detailed analysis). Although the name indicates "remain", the counter actually starts at 0. This could be the same case for the Wear_Leveling_Count attribute (TBV!).

In my own Samsung SSDs, I can see the following values:

ckadm@mintp ~ $ sudo smartctl -a /dev/sda | grep "Wear_Leveling_Count"
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       19
ckadm@mintp ~ $ sudo smartctl -a /dev/sdb | grep "Wear_Leveling_Count"
177 Wear_Leveling_Count     0x0013   098   098   000    Pre-fail  Always       -       22

I personally interpret this as 19% and 22% - which would still be largely OK if 100% is the assumed MAX value.

Now the big question is the following: Do we find proof somewhere, that Wear_Leveling_Count is really an important indicator for a pre-failure? If yes -> From which value on is this considered to be CRITICAL? Above 90? I see in your drive you have a value of 2981 - whatever this means.

Until this is discussed and solved, you can use the following workaround (append the raw list):

$ ./check_smart.pl -r "Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Wear_Leveling_Count" -d /dev/sda -i ata
WARNING: Drive  Samsung SSD 850 EVO 500GB S/N XXX:  Wear_Leveling_Count is non-zero (19), |Reallocated_Sector_Ct=0 Power_On_Hours=19456 Power_Cycle_Count=435 Wear_Leveling_Count=19 Used_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 Uncorrectable_Error_Cnt=0 Airflow_Temperature_Cel=34 ECC_Error_Rate=0 CRC_Error_Count=0 POR_Recovery_Count=8 Total_LBAs_Written=24168613381
Napsty commented 2 years ago

I found an interesting Samsung (official) document. https://image-us.samsung.com/SamsungUS/b2b/resource/2016/05/31/WHP-SSD-SSDSMARTATTRIBUTES-APR16J.pdf

The raw value of Wear Leveling Count reports the amount of NAND writes as a function of consumed P/e cycles, meaning that an increment of 1 corresponds to one full drive write. it should be noted that one full drive write in this context means the physical, raw NAND capacity of the drive, so in case of a 960gb sM863 for example, an increase of 1 in Wear Leveling Count translates to 1,024gib of NAND writes.

This indicates, that Wear_Leveling_Count (raw value) means the number of full drive writes. So if you have a 500GB drive and the Wear_Leveling_Count raw value is at 19 (in my case), this would mean that (roughly) 19 * 500GB has been written on the drive.

This can help you to calculate an estimated lifetime remaining (see Samsung document for the formula) but it does not indicate a pending failure.

The document mentions the following attributes to be considered critical for drive health:

The four SMART attributes listed in the table below are the most important indicators of drive health. if any of the normalized values drop below the 10% threshold, it’s recommended to replace the drive as soon as possible because it’s approaching the end of its life and may become unreliable if used longer.

179 Unused Reserved block Count (Used_Rsvd_Blk_Cnt_Tot) 181 Program fail Count (Program_Fail_Cnt_Total) -> already part of default raw list 182 Erase Fail Count (Erase_Fail_Count_Total) 183 Runtime Bad Count (Runtime_Bad_Block) -> already part of default raw list

So I suggest to add Erase_Fail_Count_Total to the default raw list.

pschonmann commented 2 years ago

I found some info https://web.archive.org/web/20150310051031/http://www.samsung.com/global/business/semiconductor/minisite/SSD/global/html/whitepaper/whitepaper07.html

This attribute represents the number of media program and erase operations (the number of times a block has been erased). This value is directly related to the lifetime of the SSD. The raw value of this attribute shows the total count of P/E Cycles.

SRC: https://newbedev.com/how-to-check-the-life-left-in-ssd-or-the-medium-s-wear-level

pschonmann commented 2 years ago

Ok, now i run...

check_smart.pl -g '/dev/sd[a-z] /dev/sd[abc][a-z]' -i 'auto' -E Airflow_Temperature_Cel -w 'Reallocated_Sector_Ct=15,Current_Pending_Sector=100,Reallocated_Event_Count=100,Runtime_Bad_Block=100,Uncorrectable_Error_Cnt=100,Wear_Leveling_Count=300,Erase_Fail_Count_Total=1' --debug

But in debug output i see (debug) Erase_Fail_Count_Total not in raw check list (raw value: 0) Is that value monitored ? I have no disk with value > 0 to test (unfortunately :) )

EDIT: OH, i have old version 6.9.0. Updated and seems ok

Napsty commented 2 years ago

@pschonmann https://raw.githubusercontent.com/Napsty/check_smart/6.11.1/check_smart.pl now contains Erase_Fail_Count_Total in the default raw list. This will be released in the next version, 6.12.0.

pschonmann commented 2 years ago

Thanks. And would be possible to monitor Wear_leveling_count normalised values ? Normalized value: decrements from 100 to 0. Would be fine be informed when last 10% and change of disk is recommended.

Napsty commented 2 years ago

Unfortunately we cannot. We only can read the raw values from smartctl. Unless you know how to?