AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
5.06k stars 165 forks source link

[Feature] Add support for additional arguments when smartctl is executed - Seagate drives use 48 bit raw values and only the first 16 bits are the error data #255

Closed Parlane closed 2 years ago

Parlane commented 2 years ago

Describe the bug Seagate Ironwolf drives show as FAILED with high seek and read error counts

Expected behavior

Some way to configure per drive some extra arguments to smartctl calls.

Seagate ironwolfs use a 48 bit value that is made up of 16 bits of error count and 32 bit of total count of read or seek events.

For smartctl I have to manually specify the correct bits to read from: smartctl /dev/sdb -a -v 1,raw48:54 -v 7,raw48:54

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   067   044    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   085   080   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   071   060   045    Pre-fail  Always       -       0

And smartctl without the specification:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   067   044    Pre-fail  Always       -       200450784
  3 Spin_Up_Time            0x0003   085   080   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   071   060   045    Pre-fail  Always       -       12399940

The 200450784 value above is 0xBF2A2E0, which is only 28 bits of data (so only part of the count, not the error), the full hex would be: 00000BF2A2E0 where it would then be split as [0000][0BF2A2E0] and 0 is the actual value of Raw_Read_Error_Rate

Screenshots

image

image

AnalogJ commented 2 years ago

Sounds like a worthwhile enhancement. I'll need to take a closer look at the smartctl documentation, but this should be easy enough to implement by expanding the collector.yaml config file.

somebody-somewhere-over-the-rainbow commented 2 years ago

This is also true for Seagate Exos X18 (18TB) drives. I would assume that this is true for most - if not all - modern Seagate drives...

AnalogJ commented 2 years ago

Hey @Parlane @alexw1982 I made some changes to the collector & collector config file so that it supports overriding the smartctl --info and smartctl --xargs commands that Scrutiny uses for data collection.

Once the beta branch finishes building, can you pull the docker images and test out the changes?

Here are the relevant config file changes:

https://github.com/AnalogJ/scrutiny/blob/beta/example.collector.yaml#L57-L61

https://github.com/AnalogJ/scrutiny/blob/beta/example.collector.yaml#L74-L78

Parlane commented 2 years ago

I had to specify 'ata' for each device otherwise it did not try to find them?

# Commented Scrutiny Configuration File
#
# The default location for this file is /opt/scrutiny/config/collector.yaml.
# In some cases to improve clarity default values are specified,
# uncommented. Other example values are commented out.
#
# When this file is parsed by Scrutiny, all configuration file keys are
# lowercased automatically. As such, Configuration keys are case-insensitive,
# and should be lowercase in this file to be consistent with usage.

######################################################################
# Version
#
# version specifies the version of this configuration file schema, not
# the scrutiny binary. There is only 1 version available at the moment
version: 1

# The host id is a label used for identifying groups of disks running on the same host
# Primiarly used for hub/spoke deployments (can be left empty if using all-in-one image).
host:
  id: ""

# This block allows you to override/customize the settings for devices detected by
# Scrutiny via `smartctl --scan`
# See the "--device=TYPE" section of https://linux.die.net/man/8/smartctl
# type can be a 'string' or a 'list'
devices:
  # example to show how to override the smartctl command args (per device), see below for how to override these globally.
  - device: /dev/sda
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '-v 1,raw48:54 -v 7,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdb
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '-v 1,raw48:54 -v 7,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdc
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '-v 1,raw48:54 -v 7,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdd
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '-v 1,raw48:54 -v 7,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sde
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '-v 1,raw48:54 -v 7,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.

Still shows as failed, I guess you are using the value result and not the raw value:


      {
        "id": 1,
        "name": "Raw_Read_Error_Rate",
        "value": 82,
        "worst": 64,
        "thresh": 44,
        "raw": {
          "value": 0,
          "string": "0"
        }
      },

image

~# smartctl /dev/sda -v 1,raw48:54 -v 7,raw48:54 --xall --json -T permissive
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      3
    ],
    "svn_revision": "5338",
    "platform_info": "x86_64-linux-5.17.0-2-amd64",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-v",
      "1,raw48:54",
      "-v",
      "7,raw48:54",
      "--xall",
      "/dev/sda",
      "--json",
      "-T",
      "permissive"
    ],
    "drive_database_version": {
      "string": "7.3/5319"
    },
    "exit_status": 0
  },
  "local_time": {
    "time_t": 1653787546,
    "asctime": "Sun May 29 13:25:46 2022 NZST"
  },
  "device": {
    "name": "/dev/sda",
    "info_name": "/dev/sda [SAT]",
    "type": "sat",
    "protocol": "ATA"
  },
  "model_family": "Seagate IronWolf",
  "model_name": "ST8000VN004-3CP101",
  "serial_number": "WP00C97D",
  "wwn": {
    "naa": 5,
    "oui": 3152,
    "id": 3773008640
  },
  "firmware_version": "SC60",
  "user_capacity": {
    "blocks": 15628053168,
    "bytes": 8001563222016
  },
  "logical_block_size": 512,
  "physical_block_size": 4096,
  "rotation_rate": 7200,
  "form_factor": {
    "ata_value": 2,
    "name": "3.5 inches"
  },
  "trim": {
    "supported": false
  },
  "in_smartctl_database": true,
  "ata_version": {
    "string": "ACS-4 (minor revision not indicated)",
    "major_value": 4064,
    "minor_value": 65535
  },
  "sata_version": {
    "string": "SATA 3.3",
    "value": 511
  },
  "interface_speed": {
    "max": {
      "sata_value": 14,
      "string": "6.0 Gb/s",
      "units_per_second": 60,
      "bits_per_unit": 100000000
    },
    "current": {
      "sata_value": 3,
      "string": "6.0 Gb/s",
      "units_per_second": 60,
      "bits_per_unit": 100000000
    }
  },
  "smart_support": {
    "available": true,
    "enabled": true
  },
  "read_lookahead": {
    "enabled": true
  },
  "write_cache": {
    "enabled": true
  },
  "ata_dsn": {
    "enabled": false
  },
  "ata_security": {
    "state": 41,
    "string": "Disabled, frozen [SEC2]",
    "enabled": false,
    "frozen": true
  },
  "smart_status": {
    "passed": true
  },
  "ata_smart_data": {
    "offline_data_collection": {
      "status": {
        "value": 130,
        "string": "was completed without error",
        "passed": true
      },
      "completion_seconds": 567
    },
    "self_test": {
      "status": {
        "value": 0,
        "string": "completed without error",
        "passed": true
      },
      "polling_minutes": {
        "short": 1,
        "extended": 728,
        "conveyance": 2
      }
    },
    "capabilities": {
      "values": [
        123,
        3
      ],
      "exec_offline_immediate_supported": true,
      "offline_is_aborted_upon_new_cmd": false,
      "offline_surface_scan_supported": true,
      "self_tests_supported": true,
      "conveyance_self_test_supported": true,
      "selective_self_test_supported": true,
      "attribute_autosave_enabled": true,
      "error_logging_supported": true,
      "gp_logging_supported": true
    }
  },
  "ata_sct_capabilities": {
    "value": 20669,
    "error_recovery_control_supported": true,
    "feature_control_supported": true,
    "data_table_supported": true
  },
  "ata_smart_attributes": {
    "revision": 10,
    "table": [
      {
        "id": 1,
        "name": "Raw_Read_Error_Rate",
        "value": 82,
        "worst": 64,
        "thresh": 44,
        "when_failed": "",
        "flags": {
          "value": 15,
          "string": "POSR-- ",
          "prefailure": true,
          "updated_online": true,
          "performance": true,
          "error_rate": true,
          "event_count": false,
          "auto_keep": false
        },
        "raw": {
          "value": 0,
          "string": "0"
        }
      },
      {
        "id": 3,
        "name": "Spin_Up_Time",
        "value": 99,
        "worst": 99,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 3,
          "string": "PO---- ",
          "prefailure": true,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": false,
          "auto_keep": false
        },
        "raw": {
          "value": 0,
          "string": "0"
        }
      },
      {
        "id": 4,
        "name": "Start_Stop_Count",
        "value": 100,
        "worst": 100,
        "thresh": 20,
        "when_failed": "",
        "flags": {
          "value": 50,
          "string": "-O--CK ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": true
        },
        "raw": {
          "value": 1,
          "string": "1"
        }
      },
      {
        "id": 5,
        "name": "Reallocated_Sector_Ct",
        "value": 100,
        "worst": 100,
        "thresh": 10,
        "when_failed": "",
        "flags": {
          "value": 51,
          "string": "PO--CK ",
          "prefailure": true,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": true
        },
        "raw": {
          "value": 0,
          "string": "0"
        }
      },
      {
        "id": 7,
        "name": "Seek_Error_Rate",
        "value": 100,
        "worst": 253,
        "thresh": 45,
        "when_failed": "",
        "flags": {
          "value": 15,
          "string": "POSR-- ",
          "prefailure": true,
          "updated_online": true,
          "performance": true,
          "error_rate": true,
          "event_count": false,
          "auto_keep": false
        },
        "raw": {
          "value": 0,
          "string": "0"
        }
      },
      {
        "id": 9,
        "name": "Power_On_Hours",
        "value": 100,
        "worst": 100,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 50,
          "string": "-O--CK ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": true
        },
        "raw": {
          "value": 21,
          "string": "21"
        }
      },
      {
        "id": 10,
        "name": "Spin_Retry_Count",
        "value": 100,
        "worst": 100,
        "thresh": 97,
        "when_failed": "",
        "flags": {
          "value": 19,
          "string": "PO--C- ",
          "prefailure": true,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": false
        },
        "raw": {
          "value": 0,
          "string": "0"
        }
      },
      {
        "id": 12,
        "name": "Power_Cycle_Count",
        "value": 100,
        "worst": 100,
        "thresh": 20,
        "when_failed": "",
        "flags": {
          "value": 50,
          "string": "-O--CK ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": true
        },
        "raw": {
          "value": 1,
          "string": "1"
        }
      },
      {
        "id": 18,
        "name": "Head_Health",
        "value": 100,
        "worst": 100,
        "thresh": 50,
        "when_failed": "",
        "flags": {
          "value": 11,
          "string": "PO-R-- ",
          "prefailure": true,
          "updated_online": true,
          "performance": false,
          "error_rate": true,
          "event_count": false,
          "auto_keep": false
        },
        "raw": {
          "value": 0,
          "string": "0"
        }
      },
      {
        "id": 187,
        "name": "Reported_Uncorrect",
        "value": 100,
        "worst": 100,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 50,
          "string": "-O--CK ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": true
        },
        "raw": {
          "value": 0,
          "string": "0"
        }
      },
      {
        "id": 188,
        "name": "Command_Timeout",
        "value": 100,
        "worst": 100,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 50,
          "string": "-O--CK ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": true
        },
        "raw": {
          "value": 0,
          "string": "0"
        }
      },
      {
        "id": 190,
        "name": "Airflow_Temperature_Cel",
        "value": 61,
        "worst": 51,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 34,
          "string": "-O---K ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": false,
          "auto_keep": true
        },
        "raw": {
          "value": 689438759,
          "string": "39 (Min/Max 24/41)"
        }
      },
      {
        "id": 192,
        "name": "Power-Off_Retract_Count",
        "value": 100,
        "worst": 100,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 50,
          "string": "-O--CK ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": true
        },
        "raw": {
          "value": 1,
          "string": "1"
        }
      },
      {
        "id": 193,
        "name": "Load_Cycle_Count",
        "value": 100,
        "worst": 100,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 50,
          "string": "-O--CK ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": true
        },
        "raw": {
          "value": 2,
          "string": "2"
        }
      },
      {
        "id": 194,
        "name": "Temperature_Celsius",
        "value": 39,
        "worst": 41,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 34,
          "string": "-O---K ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": false,
          "auto_keep": true
        },
        "raw": {
          "value": 103079215143,
          "string": "39 (0 24 0 0 0)"
        }
      },
      {
        "id": 197,
        "name": "Current_Pending_Sector",
        "value": 100,
        "worst": 100,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 18,
          "string": "-O--C- ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": false
        },
        "raw": {
          "value": 0,
          "string": "0"
        }
      },
      {
        "id": 198,
        "name": "Offline_Uncorrectable",
        "value": 100,
        "worst": 100,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 16,
          "string": "----C- ",
          "prefailure": false,
          "updated_online": false,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": false
        },
        "raw": {
          "value": 0,
          "string": "0"
        }
      },
      {
        "id": 199,
        "name": "UDMA_CRC_Error_Count",
        "value": 200,
        "worst": 200,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 62,
          "string": "-OSRCK ",
          "prefailure": false,
          "updated_online": true,
          "performance": true,
          "error_rate": true,
          "event_count": true,
          "auto_keep": true
        },
        "raw": {
          "value": 0,
          "string": "0"
        }
      },
      {
        "id": 240,
        "name": "Head_Flying_Hours",
        "value": 100,
        "worst": 253,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 0,
          "string": "------ ",
          "prefailure": false,
          "updated_online": false,
          "performance": false,
          "error_rate": false,
          "event_count": false,
          "auto_keep": false
        },
        "raw": {
          "value": 10408848946888725,
          "string": "21h+40m+23.499s"
        }
      },
      {
        "id": 241,
        "name": "Total_LBAs_Written",
        "value": 100,
        "worst": 253,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 0,
          "string": "------ ",
          "prefailure": false,
          "updated_online": false,
          "performance": false,
          "error_rate": false,
          "event_count": false,
          "auto_keep": false
        },
        "raw": {
          "value": 861495235,
          "string": "861495235"
        }
      },
      {
        "id": 242,
        "name": "Total_LBAs_Read",
        "value": 100,
        "worst": 253,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 0,
          "string": "------ ",
          "prefailure": false,
          "updated_online": false,
          "performance": false,
          "error_rate": false,
          "event_count": false,
          "auto_keep": false
        },
        "raw": {
          "value": 14887785,
          "string": "14887785"
        }
      }
    ]
  },
  "power_on_time": {
    "hours": 21
  },
  "power_cycle_count": 1,
  "temperature": {
    "current": 39,
    "power_cycle_min": 24,
    "power_cycle_max": 41,
    "lifetime_min": 31,
    "lifetime_max": 41,
    "op_limit_min": 5,
    "op_limit_max": 70,
    "limit_min": 5,
    "limit_max": 70,
    "lifetime_over_limit_minutes": 0,
    "lifetime_under_limit_minutes": 0
  },
  "ata_log_directory": {
    "gp_dir_version": 1,
    "smart_dir_version": 1,
    "smart_dir_multi_sector": true,
    "table": [
      {
        "address": 0,
        "name": "Log Directory",
        "read": true,
        "write": false,
        "gp_sectors": 1,
        "smart_sectors": 1
      },
      {
        "address": 1,
        "name": "Summary SMART error log",
        "read": true,
        "write": false,
        "smart_sectors": 1
      },
      {
        "address": 2,
        "name": "Comprehensive SMART error log",
        "read": true,
        "write": false,
        "smart_sectors": 5
      },
      {
        "address": 3,
        "name": "Ext. Comprehensive SMART error log",
        "read": true,
        "write": false,
        "gp_sectors": 5
      },
      {
        "address": 4,
        "name": "Device Statistics log",
        "read": true,
        "write": false,
        "gp_sectors": 256,
        "smart_sectors": 8
      },
      {
        "address": 6,
        "name": "SMART self-test log",
        "read": true,
        "write": false,
        "smart_sectors": 1
      },
      {
        "address": 7,
        "name": "Extended self-test log",
        "read": true,
        "write": false,
        "gp_sectors": 1
      },
      {
        "address": 8,
        "name": "Power Conditions log",
        "read": true,
        "write": false,
        "gp_sectors": 2
      },
      {
        "address": 9,
        "name": "Selective self-test log",
        "read": true,
        "write": true,
        "smart_sectors": 1
      },
      {
        "address": 10,
        "name": "Device Statistics Notification",
        "read": true,
        "write": true,
        "gp_sectors": 8
      },
      {
        "address": 12,
        "name": "Pending Defects log",
        "read": true,
        "write": false,
        "gp_sectors": 2048
      },
      {
        "address": 16,
        "name": "NCQ Command Error log",
        "read": true,
        "write": false,
        "gp_sectors": 1
      },
      {
        "address": 17,
        "name": "SATA Phy Event Counters log",
        "read": true,
        "write": false,
        "gp_sectors": 1
      },
      {
        "address": 19,
        "name": "SATA NCQ Send and Receive log",
        "read": true,
        "write": false,
        "gp_sectors": 1
      },
      {
        "address": 33,
        "name": "Write stream error log",
        "read": true,
        "write": false,
        "gp_sectors": 1
      },
      {
        "address": 34,
        "name": "Read stream error log",
        "read": true,
        "write": false,
        "gp_sectors": 1
      },
      {
        "address": 36,
        "name": "Current Device Internal Status Data log",
        "read": true,
        "write": false,
        "gp_sectors": 768
      },
      {
        "address": 47,
        "name": "Set Sector Configuration",
        "gp_sectors": 1
      },
      {
        "address": 48,
        "name": "IDENTIFY DEVICE data log",
        "read": true,
        "write": false,
        "gp_sectors": 9,
        "smart_sectors": 9
      },
      {
        "address": 128,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 129,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 130,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 131,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 132,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 133,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 134,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 135,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 136,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 137,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 138,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 139,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 140,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 141,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 142,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 143,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 144,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 145,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 146,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 147,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 148,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 149,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 150,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 151,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 152,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 153,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 154,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 155,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 156,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 157,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 158,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 159,
        "name": "Host vendor specific log",
        "read": true,
        "write": true,
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 161,
        "name": "Device vendor specific log",
        "gp_sectors": 160,
        "smart_sectors": 160
      },
      {
        "address": 162,
        "name": "Device vendor specific log",
        "gp_sectors": 16320
      },
      {
        "address": 164,
        "name": "Device vendor specific log",
        "gp_sectors": 160,
        "smart_sectors": 160
      },
      {
        "address": 166,
        "name": "Device vendor specific log",
        "gp_sectors": 192
      },
      {
        "address": 168,
        "name": "Device vendor specific log",
        "gp_sectors": 136,
        "smart_sectors": 136
      },
      {
        "address": 169,
        "name": "Device vendor specific log",
        "gp_sectors": 136,
        "smart_sectors": 136
      },
      {
        "address": 171,
        "name": "Device vendor specific log",
        "gp_sectors": 1
      },
      {
        "address": 173,
        "name": "Device vendor specific log",
        "gp_sectors": 16
      },
      {
        "address": 177,
        "name": "Device vendor specific log",
        "gp_sectors": 160,
        "smart_sectors": 160
      },
      {
        "address": 182,
        "name": "Device vendor specific log",
        "gp_sectors": 1920
      },
      {
        "address": 190,
        "name": "Device vendor specific log",
        "gp_sectors": 65535
      },
      {
        "address": 191,
        "name": "Device vendor specific log",
        "gp_sectors": 65535
      },
      {
        "address": 193,
        "name": "Device vendor specific log",
        "gp_sectors": 8,
        "smart_sectors": 8
      },
      {
        "address": 195,
        "name": "Device vendor specific log",
        "gp_sectors": 24,
        "smart_sectors": 24
      },
      {
        "address": 198,
        "name": "Device vendor specific log",
        "gp_sectors": 5184
      },
      {
        "address": 199,
        "name": "Device vendor specific log",
        "gp_sectors": 8,
        "smart_sectors": 8
      },
      {
        "address": 201,
        "name": "Device vendor specific log",
        "gp_sectors": 8,
        "smart_sectors": 8
      },
      {
        "address": 202,
        "name": "Device vendor specific log",
        "gp_sectors": 16,
        "smart_sectors": 16
      },
      {
        "address": 205,
        "name": "Device vendor specific log",
        "gp_sectors": 1,
        "smart_sectors": 1
      },
      {
        "address": 206,
        "name": "Device vendor specific log",
        "gp_sectors": 1
      },
      {
        "address": 207,
        "name": "Device vendor specific log",
        "gp_sectors": 512
      },
      {
        "address": 209,
        "name": "Device vendor specific log",
        "gp_sectors": 656
      },
      {
        "address": 210,
        "name": "Device vendor specific log",
        "gp_sectors": 10256
      },
      {
        "address": 212,
        "name": "Device vendor specific log",
        "gp_sectors": 2048
      },
      {
        "address": 218,
        "name": "Device vendor specific log",
        "gp_sectors": 1,
        "smart_sectors": 1
      },
      {
        "address": 224,
        "name": "SCT Command/Status",
        "read": true,
        "write": true,
        "gp_sectors": 1,
        "smart_sectors": 1
      },
      {
        "address": 225,
        "name": "SCT Data Transfer",
        "read": true,
        "write": true,
        "gp_sectors": 1,
        "smart_sectors": 1
      }
    ]
  },
  "ata_smart_error_log": {
    "extended": {
      "revision": 1,
      "sectors": 5,
      "count": 0
    }
  },
  "ata_smart_self_test_log": {
    "extended": {
      "revision": 1,
      "sectors": 1,
      "count": 0
    }
  },
  "ata_smart_selective_self_test_log": {
    "revision": 1,
    "table": [
      {
        "lba_min": 0,
        "lba_max": 0,
        "status": {
          "value": 0,
          "string": "Not_testing"
        }
      },
      {
        "lba_min": 0,
        "lba_max": 0,
        "status": {
          "value": 0,
          "string": "Not_testing"
        }
      },
      {
        "lba_min": 0,
        "lba_max": 0,
        "status": {
          "value": 0,
          "string": "Not_testing"
        }
      },
      {
        "lba_min": 0,
        "lba_max": 0,
        "status": {
          "value": 0,
          "string": "Not_testing"
        }
      },
      {
        "lba_min": 0,
        "lba_max": 0,
        "status": {
          "value": 0,
          "string": "Not_testing"
        }
      }
    ],
    "flags": {
      "value": 0,
      "remainder_scan_enabled": false
    },
    "power_up_scan_resume_minutes": 0
  },
  "ata_sct_status": {
    "format_version": 3,
    "sct_version": 522,
    "device_state": {
      "value": 0,
      "string": "Active"
    },
    "temperature": {
      "current": 38,
      "power_cycle_min": 24,
      "power_cycle_max": 41,
      "lifetime_min": 24,
      "lifetime_max": 49,
      "under_limit_count": 0,
      "over_limit_count": 22
    },
    "smart_status": {
      "passed": true
    },
    "vendor_specific": [
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      3,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0,
      0
    ]
  },
  "ata_sct_temperature_history": {
    "version": 2,
    "sampling_period_minutes": 4,
    "logging_interval_minutes": 59,
    "temperature": {
      "op_limit_min": 10,
      "op_limit_max": 25,
      "limit_min": 5,
      "limit_max": 70
    },
    "size": 128,
    "index": 24,
    "table": [
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      null,
      49,
      null,
      24,
      39,
      38,
      38,
      38,
      38,
      38,
      38,
      37,
      37,
      36,
      36,
      36,
      36,
      35,
      35,
      35,
      35,
      35,
      35,
      37,
      38,
      38
    ]
  },
  "ata_sct_erc": {
    "read": {
      "enabled": true,
      "deciseconds": 70
    },
    "write": {
      "enabled": true,
      "deciseconds": 70
    }
  },
  "ata_device_statistics": {
    "pages": [
      {
        "number": 1,
        "name": "General Statistics",
        "revision": 1,
        "table": [
          {
            "offset": 8,
            "name": "Lifetime Power-On Resets",
            "size": 4,
            "value": 1,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 16,
            "name": "Power-on Hours",
            "size": 4,
            "value": 21,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 24,
            "name": "Logical Sectors Written",
            "size": 6,
            "value": 861495235,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 32,
            "name": "Number of Write Commands",
            "size": 6,
            "value": 3228990,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 40,
            "name": "Logical Sectors Read",
            "size": 6,
            "value": 14887785,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 48,
            "name": "Number of Read Commands",
            "size": 6,
            "value": 159012,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 56,
            "name": "Date and Time TimeStamp",
            "size": 6,
            "flags": {
              "value": 128,
              "string": "---- ",
              "valid": false,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          }
        ]
      },
      {
        "number": 3,
        "name": "Rotating Media Statistics",
        "revision": 1,
        "table": [
          {
            "offset": 8,
            "name": "Spindle Motor Power-on Hours",
            "size": 4,
            "value": 21,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 16,
            "name": "Head Flying Hours",
            "size": 4,
            "value": 21,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 24,
            "name": "Head Load Events",
            "size": 4,
            "value": 2,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 32,
            "name": "Number of Reallocated Logical Sectors",
            "size": 4,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 40,
            "name": "Read Recovery Attempts",
            "size": 4,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 48,
            "name": "Number of Mechanical Start Failures",
            "size": 4,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 56,
            "name": "Number of Realloc. Candidate Logical Sectors",
            "size": 4,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 64,
            "name": "Number of High Priority Unload Events",
            "size": 4,
            "value": 1,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          }
        ]
      },
      {
        "number": 4,
        "name": "General Errors Statistics",
        "revision": 1,
        "table": [
          {
            "offset": 8,
            "name": "Number of Reported Uncorrectable Errors",
            "size": 4,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 16,
            "name": "Resets Between Cmd Acceptance and Completion",
            "size": 4,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 24,
            "name": "Physical Element Status Changed",
            "size": 4,
            "value": 0,
            "flags": {
              "value": 208,
              "string": "V-D- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": true,
              "monitored_condition_met": false
            }
          }
        ]
      },
      {
        "number": 5,
        "name": "Temperature Statistics",
        "revision": 1,
        "table": [
          {
            "offset": 8,
            "name": "Current Temperature",
            "size": 1,
            "value": 39,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 16,
            "name": "Average Short Term Temperature",
            "size": 1,
            "flags": {
              "value": 128,
              "string": "---- ",
              "valid": false,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 24,
            "name": "Average Long Term Temperature",
            "size": 1,
            "flags": {
              "value": 128,
              "string": "---- ",
              "valid": false,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 32,
            "name": "Highest Temperature",
            "size": 1,
            "value": 41,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 40,
            "name": "Lowest Temperature",
            "size": 1,
            "value": 31,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 48,
            "name": "Highest Average Short Term Temperature",
            "size": 1,
            "flags": {
              "value": 128,
              "string": "---- ",
              "valid": false,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 56,
            "name": "Lowest Average Short Term Temperature",
            "size": 1,
            "flags": {
              "value": 128,
              "string": "---- ",
              "valid": false,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 64,
            "name": "Highest Average Long Term Temperature",
            "size": 1,
            "flags": {
              "value": 128,
              "string": "---- ",
              "valid": false,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 72,
            "name": "Lowest Average Long Term Temperature",
            "size": 1,
            "flags": {
              "value": 128,
              "string": "---- ",
              "valid": false,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 80,
            "name": "Time in Over-Temperature",
            "size": 4,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 88,
            "name": "Specified Maximum Operating Temperature",
            "size": 1,
            "value": 70,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 96,
            "name": "Time in Under-Temperature",
            "size": 4,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 104,
            "name": "Specified Minimum Operating Temperature",
            "size": 1,
            "value": 5,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          }
        ]
      },
      {
        "number": 6,
        "name": "Transport Statistics",
        "revision": 1,
        "table": [
          {
            "offset": 8,
            "name": "Number of Hardware Resets",
            "size": 4,
            "value": 2,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 16,
            "name": "Number of ASR Events",
            "size": 4,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 24,
            "name": "Number of Interface CRC Errors",
            "size": 4,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          }
        ]
      },
      {
        "number": 255,
        "name": "Vendor Specific Statistics",
        "revision": 1,
        "table": [
          {
            "offset": 16,
            "name": "Vendor Specific",
            "size": 7,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          },
          {
            "offset": 24,
            "name": "Vendor Specific",
            "size": 7,
            "value": 0,
            "flags": {
              "value": 192,
              "string": "V--- ",
              "valid": true,
              "normalized": false,
              "supports_dsn": false,
              "monitored_condition_met": false
            }
          }
        ]
      }
    ]
  },
  "ata_pending_defects_log": {
    "size": 65535,
    "count": 0
  },
  "sata_phy_event_counters": {
    "table": [
      {
        "id": 10,
        "name": "Device-to-host register FISes sent due to a COMRESET",
        "size": 2,
        "value": 3,
        "overflow": false
      },
      {
        "id": 1,
        "name": "Command failed due to ICRC error",
        "size": 2,
        "value": 0,
        "overflow": false
      },
      {
        "id": 3,
        "name": "R_ERR response for device-to-host data FIS",
        "size": 2,
        "value": 0,
        "overflow": false
      },
      {
        "id": 4,
        "name": "R_ERR response for host-to-device data FIS",
        "size": 2,
        "value": 0,
        "overflow": false
      },
      {
        "id": 6,
        "name": "R_ERR response for device-to-host non-data FIS",
        "size": 2,
        "value": 0,
        "overflow": false
      },
      {
        "id": 7,
        "name": "R_ERR response for host-to-device non-data FIS",
        "size": 2,
        "value": 0,
        "overflow": false
      }
    ],
    "reset": false
  }
}
AnalogJ commented 2 years ago

Yeah in some cases the normalized data is what is used for backblaze reliability comparison:

https://www.backblaze.com/blog-smart-stats-2014-8.html#S1R

Though in this case, while the raw graph has "nicer" failure thresholds, the normalized data was chosen:

(Vendor specific raw value.) Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number.

https://github.com/AnalogJ/scrutiny/blob/master/webapp/backend/pkg/thresholds/ata_attribute_metadata.go#L38

somebody-somewhere-over-the-rainbow commented 2 years ago

so, how do I make scrutiny use the raw value to get the drive shown as healthy? or is it simply not possible at this time?

ChromoX commented 2 years ago

@Parlane I followed your example and that fixed most of the fields except for the 0x07/Seek Error Rate field. Therefore Scrutiny still reports the drive as "Failed".

It seems there might be a bug where it's using the VALUE column returned by smartctl instead of the RAW_VALUE column?

Parlane commented 2 years ago

@Parlane I followed your example and that fixed most of the fields except for the 0x07/Seek Error Rate field. Therefore Scrutiny still reports the drive as "Failed".

It seems there might be a bug where it's using the VALUE column returned by smartctl instead of the RAW_VALUE column?

Yes sorry, mine also shows as failed still. I think the real problem might now be in how smartctl calculates value. Or we could ask @AnalogJ to allow a config choice to use the raw value, with the raw_read or seek errors any value above 0 is bad IMO. And I would be happy to mark a drive as bad simply because the raw value was not 0 in the case of my seagate ironwolfs.

NemesisRE commented 2 years ago

That is not completely true values above 0 can be bad but don't have to be, there is also a temporal component if values are not to high and are stable, they do not increase, there is no need to replace a drive cause it is not failed it is fully operational which would mean it is only in a warning or error state.

I am speaking from experience with ~9000 drives.

AnalogJ commented 2 years ago

I think this is working correctly now, but it requires some steps & an explanation (and possibly a Seagate specific troubleshooting guide).

I don't have time to write that up right now, but I'll re-open this issue so I don't forget.

adhawkins commented 2 years ago

I have just installed scrutiny, and am seeing the same issue with similar drives. Will follow this issue in the hope the documentation arrives.

AnalogJ commented 2 years ago

TL;DR;

  1. Upgrade to v0.4.13+
  2. Reset your drive status using the SQLite script in #device-failed-but-smart--scrutiny-passed
  3. Wait for (or manually start) the collector.

Please try these steps and comment below if they work for you. Thanks! 🙏


The following explanation is documented here

As thoroughly discussed in #255, Seagate (Ironwolf & others) drives are almost always marked as failed by Scrutiny.

The Seek Error Rate & Read Error Rate attribute raw values are typically very high, and the normalised values (Current / Worst / Threshold) are usually quite low. Despite this, the numbers in most cases are perfectly OK

The anxiety arises because we intuitively expect that the normalised values should reflect a "health" score, with 100 being the ideal value. Similarly, we would expect that the raw values should reflect an error count, in which case a value of 0 would be most desirable. However, Seagate calculates and applies these attribute values in a counterintuitive way.

http://www.users.on.net/~fzabkar/HDD/Seagate_SER_RRER_HEC.html

Some analysis has been done which shows that Seagate drives break the common SMART conventions, which also causes Scrutiny's comparison against BackBlaze data to detect these drives as failed.

So what's the Solution?

After taking a look at the BackBlaze data for the relevant Attributes (Seek Error Rate & Read Error Rate), I've decided to disable Scrutiny analysis for them. Both are non-critical, and have low-correlation with failure.

Please note: SMART failures for these attributes will still cause the drive to be marked as failed. Only BackBlaze analysis has been disabled

If this is effecting your drives, you'll need to do the following:

  1. Upgrade to v0.4.13+
  2. Reset your drive status using the SQLite script in #device-failed-but-smart--scrutiny-passed
  3. Wait for (or manually start) the collector.

If you'd like to learn more about how the Seagate Ironwolf SMART attributes work under the hood, and how they differ from other drives, please read the following:

fightforlife commented 2 years ago

Is there maybe a very similiar issue with the "Command timeout" on Seagate drives? Both my Seagate drives report unrealistic values, while the other drives report normal values. image image

It seems the soultion is reported here: https://forums.tomshardware.com/threads/very-high-command-timeout.648978/

In fact the actual value is 4, not 262148.
The 48-bit raw value is often composed of three 16-bit components, ie ...
0x000000040004 = 0x0000 0x0004 0x0004
Parlane commented 2 years ago

TL;DR;

  1. Upgrade to v0.4.13+
  2. Reset your drive status using the SQLite script in #device-failed-but-smart--scrutiny-passed
  3. Wait for (or manually start) the collector.

I am running: beta#7a6c94a (docker ghcr.io/analogj/scrutiny:beta-omnibus)

I used sqlite to update the status to not failed. Reran the "scrutiny-collector-metrics run" manually. All seagate marked as failed.

image

Parlane commented 2 years ago

Oops I see beta is actually behind master now... I will try master instead.

Parlane commented 2 years ago

Yay it works :) With master#145c819

Thank you @AnalogJ

adhawkins commented 2 years ago

Mine appears to work correctly too, thanks.

AnalogJ commented 2 years ago

fantastic, closing this out as fixed in v0.4.13 (#301)

Thanks for all you help everyone!! 🥳


@fightforlife similar issue, but not quite as dire. For the command timeout attribute, scrutiny is checking the RAW value, and since the number is so absurdly high, it doesnt even fit into any of the buckets that we're looking for, so scrutiny just marks it as warn. If you want to fix that attribute I'm guessing you can add the following line to the collector config file for your Seagate drive.

      metrics_smart_args: '-v 188,raw48:54 --xall --json -T permissive' 

@Parlane yeah I developed this change on a different branch, glad you figured it out!

tadly commented 2 years ago

one last question @AnalogJ

Is blackblaze comparsion for seagate coming back at all or will this stay disabled now? My understanding is that the raw value could be used to compare against backblaze or am I misunderstanding this? When I say raw I mean like in the OP smartctl /dev/sdb -a -v 1,raw48:54 -v 7,raw48:54

Thanks a lot for the quick fix though. Really appreciate how much time and effort you put into this project

AnalogJ commented 2 years ago

@tadly Backblaze comparision is still enabled for Seagate, its just disabled for these 2 attributes. I'm working on a larger project to allow users to customize how Scrutiny (not SMART) analysis is done on a drive by drive basis. Unfortunately its going to take a bit of time to roll out.

Regarding the RAW attribute values, the issue is that the relevant attributes are Vendor specific, so I decided to use the Normalized value in hope that I wouldn't need to worry about how the vendor decided to encode the data. Unfortunately Seagate decided to muck around with the normalized data as well (100 & 60 are both healthy values). Using the RAW value for those attributes would require alot more data analysis, and I'd probably need to complete #10 or something similar first.

Glad everything is working for you.

MattKobayashi commented 2 years ago

Hi @AnalogJ, sorry to bring this one back up, but I seem to be having issues passing the command timeout value override as a smartctl argument in collector.yaml. I add the following to collector.yaml below devices:

  - device: /dev/sdb
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"
  - device: /dev/sdc
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"
  - device: /dev/sdd
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"
  - device: /dev/sde
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"
  - device: /dev/sdf
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"
  - device: /dev/sdg
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"

All that seems to happen when I do is Scrutiny skips those drives entirely during a scan run. Do you have any ideas on what might be going wrong here?

Parlane commented 2 years ago

Hi @AnalogJ, sorry to bring this one back up, but I seem to be having issues passing the command timeout value override as a smartctl argument in collector.yaml. I add the following to collector.yaml below devices:

  - device: /dev/sdb
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"
  - device: /dev/sdc
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"
  - device: /dev/sdd
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"
  - device: /dev/sde
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"
  - device: /dev/sdf
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"
  - device: /dev/sdg
    commands:
      metrics_smart_args: "--vendorattribute=188,raw48:54 --xall --json -T permissive"

All that seems to happen when I do is Scrutiny skips those drives entirely during a scan run. Do you have any ideas on what might be going wrong here?

Specify the device type like this, you may also need to specify both smart args and info args:

  - device: /dev/sdb
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
MattKobayashi commented 2 years ago

That fixed it, thank you @Parlane!

AnalogJ commented 2 years ago

@Parlane @MattKobayashi that's definitely a bug (missing deviceType should not cause the device to be skipped).

I've made a fix (and associated tests) in the beta branch, sorry about that. I'm going to close this issue again.

Parlane commented 2 years ago

@Parlane @MattKobayashi that's definitely a bug (missing deviceType should not cause the device to be skipped).

I've made a fix (and associated tests) in the beta branch, sorry about that. I'm going to close this issue again.

Haha when I added device type to fix my config I assumed it was me who had done it wrong 🤣