dak180 / FreeNAS-Report

SMART & ZPool Status Report for FreeNAS/TrueNAS
GNU General Public License v3.0
38 stars 8 forks source link

Feature Request - Indication of Failed Selftest #9

Closed mth309 closed 2 years ago

mth309 commented 2 years ago

Currently the script collects the type of the last selftest run, and how long ago it ran. It does not collect whether the selftest was successful or failed, and I believe this is an important oversight. You could be running daily self tests, they could all be telling you about bad LBAs or other failures, and you'd never know it with the current implementation.

I don't think another field needs to be added to the summary table for selftest status, instead I would recommend using the warning color or critical color on the 'last test type' field if the last self test failed. The current script uses the critical color on the 'last test time' if it exceeds the threshold time period, but it is not currently colorizing the test type for any reason, so this seemed like a good reason to add color to that field.

Below is example json output for a SATA drive with several failing self tests. I show the non-json version as well for ease of human reading, but from a script perspective parsing the 'status.passed' json attribute as true/false seems to be the way to go for SATA.

root@backup[~]# smartctl -l selftest /dev/da9
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     57164         4184288
# 2  Short offline       Completed: read failure       90%     57163         4193464
# 3  Extended offline    Completed: read failure       90%     57115         4200736
# 4  Short offline       Completed: read failure       60%     57114         3907009672
# 5  Short offline       Completed: read failure       90%     57058         4200736
# 6  Short offline       Completed: read failure       90%     57009         4200736
# 7  Short offline       Completed: read failure       90%     56961         4200736
# 8  Short offline       Completed: read failure       90%     56913         4200736
# 9  Short offline       Completed: read failure       40%     56864         4200736
#10  Short offline       Completed: read failure       90%     56816         4200736
#11  Short offline       Completed: read failure       90%     56768         4200736
#12  Short offline       Completed: read failure       30%     56720         4200736
#13  Short offline       Completed: read failure       50%     56671         4200736
#14  Short offline       Completed without error       00%     56623         -
#15  Short offline       Completed without error       00%     56575         -
#16  Short offline       Completed without error       00%     56506         -
#17  Short offline       Completed without error       00%     56458         -
#18  Short offline       Completed without error       00%     56409         -
#19  Short offline       Completed without error       00%     56361         -
#20  Short offline       Completed without error       00%     56312         -
#21  Short offline       Completed without error       00%     56264         -

root@backup[~]# smartctl -lj selftest /dev/da9
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=======> INVALID ARGUMENT TO -l: j
=======> VALID ARGUMENTS ARE: error, selftest, selective, directory[,g|s], xerror[,N][,error], xselftest[,N][,selftest], background, sasphy[,reset], sataphy[,reset], scttemp[sts,hist], scttempint,N[,p], scterc[,N,M], devstat[,N], defects[,N], ssd, gplog,N[,RANGE], smartlog,N[,RANGE], nvmelog,N,SIZE <=======

Use smartctl -h to get a usage summary

root@backup[~]# smartctl -jl selftest /dev/da9
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      2
    ],
    "svn_revision": "5155",
    "platform_info": "FreeBSD 12.2-RELEASE-p12 amd64",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-jl",
      "selftest",
      "/dev/da9"
    ],
    "exit_status": 128
  },
  "device": {
    "name": "/dev/da9",
    "info_name": "/dev/da9 [SAT]",
    "type": "sat",
    "protocol": "ATA"
  },
  "ata_smart_self_test_log": {
    "standard": {
      "revision": 1,
      "table": [
        {
          "type": {
            "value": 2,
            "string": "Extended offline"
          },
          "status": {
            "value": 121,
            "string": "Completed: read failure",
            "remaining_percent": 90,
            "passed": false
          },
          "lifetime_hours": 57164,
          "lba": 4184288
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 121,
            "string": "Completed: read failure",
            "remaining_percent": 90,
            "passed": false
          },
          "lifetime_hours": 57163,
          "lba": 4193464
        },
        {
          "type": {
            "value": 2,
            "string": "Extended offline"
          },
          "status": {
            "value": 121,
            "string": "Completed: read failure",
            "remaining_percent": 90,
            "passed": false
          },
          "lifetime_hours": 57115,
          "lba": 4200736
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 118,
            "string": "Completed: read failure",
            "remaining_percent": 60,
            "passed": false
          },
          "lifetime_hours": 57114,
          "lba": 3907009672
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 121,
            "string": "Completed: read failure",
            "remaining_percent": 90,
            "passed": false
          },
          "lifetime_hours": 57058,
          "lba": 4200736
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 121,
            "string": "Completed: read failure",
            "remaining_percent": 90,
            "passed": false
          },
          "lifetime_hours": 57009,
          "lba": 4200736
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 121,
            "string": "Completed: read failure",
            "remaining_percent": 90,
            "passed": false
          },
          "lifetime_hours": 56961,
          "lba": 4200736
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 121,
            "string": "Completed: read failure",
            "remaining_percent": 90,
            "passed": false
          },
          "lifetime_hours": 56913,
          "lba": 4200736
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 116,
            "string": "Completed: read failure",
            "remaining_percent": 40,
            "passed": false
          },
          "lifetime_hours": 56864,
          "lba": 4200736
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 121,
            "string": "Completed: read failure",
            "remaining_percent": 90,
            "passed": false
          },
          "lifetime_hours": 56816,
          "lba": 4200736
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 121,
            "string": "Completed: read failure",
            "remaining_percent": 90,
            "passed": false
          },
          "lifetime_hours": 56768,
          "lba": 4200736
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 115,
            "string": "Completed: read failure",
            "remaining_percent": 30,
            "passed": false
          },
          "lifetime_hours": 56720,
          "lba": 4200736
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 117,
            "string": "Completed: read failure",
            "remaining_percent": 50,
            "passed": false
          },
          "lifetime_hours": 56671,
          "lba": 4200736
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 0,
            "string": "Completed without error",
            "passed": true
          },
          "lifetime_hours": 56623
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 0,
            "string": "Completed without error",
            "passed": true
          },
          "lifetime_hours": 56575
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 0,
            "string": "Completed without error",
            "passed": true
          },
          "lifetime_hours": 56506
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 0,
            "string": "Completed without error",
            "passed": true
          },
          "lifetime_hours": 56458
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 0,
            "string": "Completed without error",
            "passed": true
          },
          "lifetime_hours": 56409
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 0,
            "string": "Completed without error",
            "passed": true
          },
          "lifetime_hours": 56361
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 0,
            "string": "Completed without error",
            "passed": true
          },
          "lifetime_hours": 56312
        },
        {
          "type": {
            "value": 1,
            "string": "Short offline"
          },
          "status": {
            "value": 0,
            "string": "Completed without error",
            "passed": true
          },
          "lifetime_hours": 56264
        }
      ],
      "count": 21,
      "error_count_total": 13,
      "error_count_outdated": 0
    }
  }
}
root@backup[~]#

Unfortunately smartctl does not support json output for SCSI drive selftest logs, so it would be a bit more complicated to parse test results from the non-json format.

root@freenas:~ # smartctl -l selftest /dev/da12
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p14 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -    7195                 - [-   -    -]
# 2  Background short  Completed                   -    7027                 - [-   -    -]
# 3  Background short  Completed                   -    6811                 - [-   -    -]
# 4  Background short  Completed                   -    6644                 - [-   -    -]
# 5  Background short  Completed                   -    6476                 - [-   -    -]
# 6  Background short  Completed                   -    6308                 - [-   -    -]
# 7  Background short  Completed                   -    6068                 - [-   -    -]
# 8  Background short  Completed                   -    5900                 - [-   -    -]
# 9  Background short  Completed                   -    5733                 - [-   -    -]
#10  Background short  Completed                   -    5565                 - [-   -    -]
#11  Background short  Completed                   -    5397                 - [-   -    -]
#12  Background short  Completed                   -    5229                 - [-   -    -]
#13  Background short  Completed                   -    5061                 - [-   -    -]
#14  Background short  Completed                   -    4893                 - [-   -    -]
#15  Background short  Completed                   -    4669                 - [-   -    -]
#16  Background short  Completed                   -    4501                 - [-   -    -]
#17  Background short  Completed                   -    4333                 - [-   -    -]
#18  Background short  Completed                   -    4165                 - [-   -    -]
#19  Background short  Completed                   -    3925                 - [-   -    -]
#20  Background short  Completed                   -    3757                 - [-   -    -]

From experience I would tell you that if you get any value other than the hyphen in any of the final 4 positions, it would be considered a test failure. Usually all 4 fields would change to a non-hyphen at the same time, but it's possible the drive might report a Key Code Qualifier (KCQ) in the final three fields without reporting an LBA where the error took place, or vice versa. In any case, if anything shows up in any of those 4 fields it's worth flagging the last test in a bad color so the user knows to look into it.

mth309 commented 2 years ago

Just wanted to let you know that depending how this weekend goes I might find time to write the code and send you a pull request to implement the above behavior. It should only be a few lines of code for each drive type (SAS/SATA). I wanted to post the request here so you could see what I'm thinking first and let me know if you don't agree. Also if you have an opinion on the idea to colorize the test type instead of adding a new pass/fail field, or whether you prefer warning or critical color for that, please let me know. If you want to implement the feature yourself rather than wait on me by all means go for it, but if you're not in a rush I'll be adding it to my own server at some point and will send it to you after. Thanks!

dak180 commented 2 years ago

@mth309 since I do not have sas drives and you do I will let you take the first pass at them.

mth309 commented 2 years ago

@dak180 I just submitted a pull request for the SAS version of the code.