AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
5.06k stars 165 forks source link

[BUG] Lots of disks marked as "failed" #336

Open joe-eklund opened 2 years ago

joe-eklund commented 2 years ago

Describe the bug I have 24 Seagate 10 TB exos drives. 11 of 24 are marked as "failed" in the Scrutiny dashboard. When inspected, none of the 11 have any critical attributes marked as failed. They all have one or both marked as failed for Hardware ECC Recovered and High Fly Writes.

I have extensively read through https://github.com/AnalogJ/scrutiny/issues/255, https://github.com/AnalogJ/scrutiny/blob/master/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md#seagate-drives-failing, and some other issues that referenced similar things. Looks like Seagate has been a problem child.

This makes me question if these are "incorrectly" marked as failed or not. I will say I followed the troubleshooting instructions and I had started out with 12 disks marked as failed, then it dropped to 11 after I followed the recommendations at https://github.com/AnalogJ/scrutiny/blob/master/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md#seagate-drives-failing.

So my two questions are:

  1. Are these two values being correctly reported? They very well may be correctly reporting as "failed" and those values are correct, I just want to verify that.
  2. Can I have Scrutiny only mark disks as failed if they fail the "critical" attribute list? I don't care about any of the other ones for the dashboard status.

Screenshots:

Screen Shot 2022-07-13 at 6 51 53 PM Screen Shot 2022-07-13 at 6 52 12 PM Screen Shot 2022-07-13 at 6 52 20 PM

My collector YAML looks like:

# Commented Scrutiny Configuration File
#
# The default location for this file is /opt/scrutiny/config/collector.yaml.
# In some cases to improve clarity default values are specified,
# uncommented. Other example values are commented out.
#
# When this file is parsed by Scrutiny, all configuration file keys are
# lowercased automatically. As such, Configuration keys are case-insensitive,
# and should be lowercase in this file to be consistent with usage.

######################################################################
# Version
#
# version specifies the version of this configuration file schema, not
# the scrutiny binary. There is only 1 version available at the moment
version: 1

# The host id is a label used for identifying groups of disks running on the same host
# Primiarly used for hub/spoke deployments (can be left empty if using all-in-one image).
host:
  id: "sauron"

# This block allows you to override/customize the settings for devices detected by
# Scrutiny via `smartctl --scan`
# See the "--device=TYPE" section of https://linux.die.net/man/8/smartctl
# type can be a 'string' or a 'list'
devices:
  - device: /dev/sda
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdb
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdc
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdd
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sde
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdf
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdg
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdh
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdi
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdj
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdk
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdl
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdm
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdn
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdo
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdp
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdq
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdr
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sds
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdt
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdu
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdv
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdw
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.
  - device: /dev/sdx
    type: 'ata'
    commands:
      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
      metrics_smart_args: '--vendorattribute=188,raw48:54 --xall --json -T permissive' # used to retrieve smart data for each device.

#  # example for forcing device type detection for a single disk
#  - device: /dev/sda
#    type: 'sat'
#
#  # example to show how to ignore a specific disk/device.
#  - device: /dev/sda
#    ignore: true
#
#  # examples showing how to force smartctl to detect disks inside a raid array/virtual disk
#  - device: /dev/bus/0
#    type:
#      - megaraid,14
#      - megaraid,15
#      - megaraid,18
#      - megaraid,19
#      - megaraid,20
#      - megaraid,21
#
#  - device: /dev/twa0
#    type:
#      - 3ware,0
#      - 3ware,1
#      - 3ware,2
#      - 3ware,3
#      - 3ware,4
#      - 3ware,5
#
#  # example to show how to override the smartctl command args (per device), see below for how to override these globally.
#  - device: /dev/sda
#    commands:
#      metrics_info_args: '--info --json -T permissive' # used to determine device unique ID & register device with Scrutiny
#      metrics_smart_args: '--xall --json -T permissive' # used to retrieve smart data for each device.

#log:
#  file: '' #absolute or relative paths allowed, eg. web.log
#  level: INFO
#
#api:
#  endpoint: 'http://localhost:8080'
#  endpoint: 'http://localhost:8080/custombasepath'
# if you need to use a custom base path (for a reverse proxy), you can add a suffix to the endpoint.
#  See docs/TROUBLESHOOTING_REVERSE_PROXY.md for more info,

# example to show how to override the smartctl command args globally
#commands:
#  metrics_smartctl_bin: 'smartctl' # change to provide custom `smartctl` binary path, eg. `/usr/sbin/smartctl`
#  metrics_scan_args: '--scan --json' # used to detect devices
#  metrics_info_args: '--info --json' # used to determine device unique ID & register device with Scrutiny
#  metrics_smart_args: '--xall --json' # used to retrieve smart data for each device.

########################################################################################################################
# FEATURES COMING SOON
#
# The following commented out sections are a preview of additional configuration options that will be available soon.
#
########################################################################################################################

#collect:
#  long:
#    enable: false
#    command: ''
#  short:
#    enable: false
#    command: ''

I can provide a log file(s) if needed. Thanks!

AnalogJ commented 2 years ago

Hey @joe-eklund

Yeah Samsung drives seem to report some of their SMART data in a non-standard way, which definitely causes issues for some users.

In your case it looks like:

Can I have Scrutiny only mark disks as failed if they fail the "critical" attribute list? I don't care about any of the other ones for the dashboard status.

This is definitely a common request, and something I'm working on (as I find time). It's currently tracked in #275

While you cannot yet configure the failure status in the dashboard, you can configure how/when you get notified -- limiting to only critical attributes: https://github.com/AnalogJ/scrutiny/issues/300#issuecomment-1155984708

joe-eklund commented 2 years ago
  • High Fly Writes is actually unusually high (atleast when compared to Backblaze's data).

    • Is this value unusually high for all your Seagate drives? I wonder if this may be another non-standard attribute.

I went and looked and a handful of drives don't even have that value at all (I guess they must be a different model or have a different firmware, even though they are still all Exos 10 TBs). Others have 0 as the number of the value, some have a WARN with a smaller number, like ~24. And then the others that are marked as failed like I already discussed. 8 of them have that value marked as failed and Scrutiny has them marked as failed. So it seems like this is a legit value that Scrutiny is marked as failing, unlike the problematic Seagate values?

  • Hardware ECC Recovered is reported in a non-standard format. If its not failing SMART, you can safely ignore it for now. Click the attribute row in the table to get extended details for the attribute.

Looks like I have three drives marked as failed in Scrutiny that have Hardware ECC Recovered marked as failed and High Fly Writes marked as warn. All the others that have Hardware ECC Recovered marked as failed also have the High Fly Writes marked as failed. So I guess I can just ignore it then...? I will say none of them are marked as SMART failed for this value.

Can I have Scrutiny only mark disks as failed if they fail the "critical" attribute list? I don't care about any of the other ones for the dashboard status.

This is definitely a common request, and something I'm working on (as I find time). It's currently tracked in #275

While you cannot yet configure the failure status in the dashboard, you can configure how/when you get notified -- limiting to only critical attributes: #300 (comment)

I see. I will go at least turn on failure notifications for critical only. That is definitely an improvement. I will keep an eye on #275 for disabling scrutiny analysis on non critical attributes.

DanAE111 commented 1 year ago

I've too noticed some of my disks are reporting failed and they are all exclusively Seagate.

Would it be possible to implement a warning status that would be raised for non critical metrics that are above the thresholds?

I really would only want to see disks marked as failed when they are having data integrity issues or have stopped working.