AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
4.72k stars 154 forks source link

[BUG] Cannot pull metrics from aacraid raid #608

Open wjbridge opened 3 months ago

wjbridge commented 3 months ago

Describe the bug Not able to pull metrics for aacraid raid anymore. Uses the command smartctl --xall --json --device sat /dev/sda instead of smartctl --xall --json --device aacraid,0,0,0 /dev/sda.

The command smartctl --xall --json --device sat /dev/sda returns in error and can reproduce this in the container (i.e. docker exec). If I use smartctl --xall --json --device aacraid,0,0,0 /dev/sda1, this works great.

Expected behavior Pull metrics from aacraid raid.

Screenshots

time="2024-03-18T20:06:00-04:00" level=info msg="Verifying required tools" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Executing command: smartctl --scan --json" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Checking Influxdb & Sqlite health" type=web
time="2024-03-18T20:06:00-04:00" level=info msg="127.0.0.1 - 4f824e999d62 [18/Mar/2024:20:06:00 -0400] \"GET /api/health\" 200 16 \"\" \"curl/7.88.1\" (1ms)" clientIP=127.0.0.1 hostname=4f824e999d62 latency=1 method=GET path=/api/health referer= respLength=16 statusCode=200 type=web userAgent=curl/7.88.1
time="2024-03-18T20:06:00-04:00" level=info msg="Executing command: smartctl --info --json --device aacraid,0,0,0 /dev/sda" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Generating WWN" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Executing command: smartctl --info --json --device aacraid,0,0,1 /dev/sda" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Generating WWN" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Executing command: smartctl --info --json --device aacraid,0,0,2 /dev/sda" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Generating WWN" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Executing command: smartctl --info --json --device aacraid,0,0,3 /dev/sda" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Generating WWN" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Executing command: smartctl --info --json --device auto /dev/nvme0n1" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Using WWN Fallback" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Sending detected devices to API, for filtering & validation" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="127.0.0.1 - 4f824e999d62 [18/Mar/2024:20:06:00 -0400] \"POST /api/devices/register\" 200 2827 \"\" \"Go-http-client/1.1\" (1ms)" clientIP=127.0.0.1 hostname=4f824e999d62 latency=1 method=POST path=/api/devices/register referer= respLength=2827 statusCode=200 type=web userAgent=Go-http-client/1.1
time="2024-03-18T20:06:00-04:00" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Executing command: smartctl --xall --json --device sat /dev/sda" type=metrics
time="2024-03-18T20:06:00-04:00" level=error msg="smartctl returned an error code (2) while processing sda\n" type=metrics
time="2024-03-18T20:06:00-04:00" level=error msg="smartctl could not open device" type=metrics
time="2024-03-18T20:06:00-04:00" level=info msg="Publishing smartctl results for 0x5000cca2abeb2d07\n" type=metrics
time="2024-03-18T20:06:01-04:00" level=info msg="Successfully sent notifications. Check logs for more information." type=web
time="2024-03-18T20:06:01-04:00" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2024-03-18T20:06:01-04:00" level=info msg="Executing command: smartctl --xall --json --device sat /dev/sda" type=metrics
time="2024-03-18T20:06:01-04:00" level=error msg="smartctl returned an error code (2) while processing sda\n" type=metrics
time="2024-03-18T20:06:01-04:00" level=error msg="smartctl could not open device" type=metrics
time="2024-03-18T20:06:01-04:00" level=info msg="Publishing smartctl results for 0x5000cca2b6f39395\n" type=metrics
time="2024-03-18T20:06:01-04:00" level=info msg="Successfully sent notifications. Check logs for more information." type=web
time="2024-03-18T20:06:01-04:00" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2024-03-18T20:06:01-04:00" level=info msg="Executing command: smartctl --xall --json --device sat /dev/sda" type=metrics
time="2024-03-18T20:06:01-04:00" level=error msg="smartctl returned an error code (2) while processing sda\n" type=metrics
time="2024-03-18T20:06:01-04:00" level=error msg="smartctl could not open device" type=metrics
time="2024-03-18T20:06:01-04:00" level=info msg="Publishing smartctl results for 0x5000cca2eccb1d14\n" type=metrics
time="2024-03-18T20:06:01-04:00" level=info msg="Successfully sent notifications. Check logs for more information." type=web
time="2024-03-18T20:06:01-04:00" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2024-03-18T20:06:01-04:00" level=info msg="Executing command: smartctl --xall --json --device sat /dev/sda" type=metrics
time="2024-03-18T20:06:02-04:00" level=error msg="smartctl returned an error code (2) while processing sda\n" type=metrics
time="2024-03-18T20:06:02-04:00" level=error msg="smartctl could not open device" type=metrics
time="2024-03-18T20:06:02-04:00" level=info msg="Publishing smartctl results for 0x5000cca2b6f61045\n" type=metrics
time="2024-03-18T20:06:02-04:00" level=info msg="Successfully sent notifications. Check logs for more information." type=web

Log Files Docker Config

  #-------------------------------------------
  # Scrutiny - WebUI for smartd S.M.A.R.T monitoring
  # https://github.com/AnalogJ/scrutiny/blob/master/docker/example.omnibus.docker-compose.yml
  #-------------------------------------------
  scrutiny:
    image: ghcr.io/analogj/scrutiny:master-omnibus
    container_name: scrutiny
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/api/health"]
      interval: 5s
      timeout: 10s
      retries: 20
      start_period: 10s
    networks:
      - $T2_NETWORK
    depends_on:
      - $PROXY
    cap_add:
      - SYS_RAWIO
      - SYS_ADMIN
    environment:
      - PUID=$PUID
      - PGID=$PGID
      - TZ=$TZ
    volumes:
      - $DOCKERCFG/Scrutiny/config:/opt/scrutiny/config
      - $DOCKERCFG/Scrutiny/influxdb:/opt/scrutiny/influxdb
      - /run/udev:/run/udev:ro
    devices:
      - /dev/nvme0n1:/dev/nvme0n1
      - /dev/sda:/dev/sda
      - /dev/aac0:/dev/aac0
    labels:
      - com.centurylinklabs.watchtower.enable=true
AnalogJ commented 3 months ago

have you created a collector config file?

https://github.com/AnalogJ/scrutiny/blob/master/example.collector.yaml#L42-L50

If smartctl is returning an error, you need to provide a config file to override/configure the smartctl command for your disks

thomashilzendegen commented 3 months ago

I have the same problem. I tracked it down to changes in the release 0.7.3 - with 0.7.2 it still works. As a workaround I will stay with that version.

wjbridge commented 3 months ago

Yes, I have created the config file. Here is my collector file. I can also confirm everything works correctly with v0.7.2.

######################################################################
# Version
#
# version specifies the version of this configuration file schema, not
# the scrutiny binary. There is only 1 version available at the moment
version: 1

# The host id is a label used for identifying groups of disks running on the same host
# Primiarly used for hub/spoke deployments (can be left empty if using all-in-one image).
host:
  id: ""

# This block allows you to override/customize the settings for devices detected by
# Scrutiny via `smartctl --scan`
# See the "--device=TYPE" section of https://linux.die.net/man/8/smartctl
# type can be a 'string' or a 'list'
devices:
# examples showing how to force smartctl to detect disks inside a raid array/virtual disk
  - device: /dev/sda
    type:
      - aacraid,0,0,0
      - aacraid,0,0,1
      - aacraid,0,0,2
      - aacraid,0,0,3

# example for forcing device type detection for a single disk
  - device: /dev/nvme0n1
    type: 'auto'

########################################################################################################################
# FEATURES COMING SOON
#
# The following commented out sections are a preview of additional configuration options that will be available soon.
#
########################################################################################################################

I am not sure how to override/configure the smartctl command for your disks that replaces the --device sat with --device aacraid,0,0,0. I thought that was coming from the collector file.