Spearfoot / FreeNAS-scripts

Handy shell scripts for use on FreeNAS servers
326 stars 64 forks source link

Error on calculation of "Last Test Age" in smart_report.sh #25

Open flederohr opened 1 year ago

flederohr commented 1 year ago

The display of the Last Test Age was working for years without any issues. On the last smart report i had this output:

+-------+------------------------+----+------+-----+-----+-------+-------+--------+------+----------+------+-----------+----+
|Device |Serial                  |Temp| Power|Start|Spin |ReAlloc|Current|Offline |Seek  |Total     |High  |    Command|Last|
|       |Number                  |    | On   |Stop |Retry|Sectors|Pending|Uncorrec|Errors|Seeks     |Fly   |    Timeout|Test|
|       |                        |    | Hours|Count|Count|       |Sectors|Sectors |      |          |Writes|    Count  |Age |
+-------+------------------------+----+------+-----+-----+-------+-------+--------+------+----------+------+-----------+----+
|ada0 ? |WD-************         |39  | 65620|  186|    0|      0|      0|       0|   N/A|       N/A|   N/A|        N/A|2732*|

...

########## SATA drive /dev/ada0 Serial: WD-************
########## Western Digital Red (WDC ************)

SMART overall-health self-assessment test result: PASSED

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   173   021    Pre-fail  Always       -       3900
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       188
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   011   011   000    Old_age   Always       -       65620
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       186
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       154
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3283
194 Temperature_Celsius     0x0022   108   094   000    Old_age   Always       -       39
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

No Errors Logged

Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Short offline       Completed without error       00%        61         -

On further analysis i found out that the S.M.A.R.T. LifeTime(hours) counter seems to have reset itself

 /usr/local/sbin/smartctl -l selftest /dev/ada0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        61         -
# 2  Extended offline    Completed without error       00%     65483         -
# 3  Short offline       Completed without error       00%     65334         -
# 4  Short offline       Completed without error       00%     65214         -
# 5  Extended offline    Completed without error       00%     65099         -
# 6  Short offline       Completed without error       00%     64974         -
# 7  Short offline       Completed without error       00%     64854         -
# 8  Extended offline    Completed without error       00%     64739         -
# 9  Short offline       Completed without error       00%     64590         -

In this resource i got the explanation that this counter is normally stored in a 16 bit field but could also differ for different HDD vendors: https://serverfault.com/questions/1041661/s-m-a-r-t-lifetime-hours-resetting-to-zero

For me i could fix the issue by adding a modulo function in the calculation testAge=sprintf("%.0f", ((onHours % 65535) - lastTestHours) / 24); https://github.com/Spearfoot/FreeNAS-scripts/blob/06ccffb9710b3d372ccefe0de4b093e00cb2a00c/smart_report.sh#L131

SavageCore commented 1 year ago

Oh my, thank you! Thought the tests hadn't been running and I was panicking. Will PR this change.