AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
5.08k stars 165 forks source link

[BUG] smartctl returned an error on QNAP #560

Closed paulmorabito closed 8 months ago

paulmorabito commented 9 months ago

Using scrutiny on my QNAP TS453BE and the latest version of scrutiny and QNAP firmware is once again giving smartctl errors. Compose and logs are below:

services:
  scrutiny:
    image: ghcr.io/analogj/scrutiny:master-omnibus
    container_name: scrutiny
    privileged: true
    cap_add:
      - SYS_RAWIO
    volumes:
      - /run/udev:/run/udev:ro
      - /share/persistent/scrutiny:/scrutiny/config
      - ./influxdb:/opt/scrutiny/influxdb
    devices:
      - /dev/sda
      - /dev/sdb
      - /dev/sdc
      - /dev/sdd
    ports:
      - 8180:8080
      - "8086:8086" 
    restart: always

logs:

__   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
github.com/AnalogJ/scrutiny                             dev-0.7.2
Start the scrutiny server
time="2023-12-29T14:38:53Z" level=info msg="Trying to connect to scrutiny sqlite db: /opt/scrutiny/config/scrutiny.db\n" type=web
time="2023-12-29T14:38:53Z" level=info msg="Successfully connected to scrutiny sqlite db: /opt/scrutiny/config/scrutiny.db\n" type=web
time="2023-12-29T14:38:53Z" level=info msg="InfluxDB certificate verification: true\n" type=web
time="2023-12-29T14:38:53Z" level=info msg="Database migration starting. Please wait, this process may take a long time...." type=web
time="2023-12-29T14:38:53Z" level=info msg="Database migration completed successfully" type=web
time="2023-12-29T14:38:53Z" level=info msg="SQLite global configuration migrations starting. Please wait...." type=web
2023/12/29 14:38:53 /go/src/github.com/analogj/scrutiny/vendor/github.com/go-gormigrate/gormigrate/v2/gormigrate.go:443 SLOW SQL >= 200ms
[422.567ms] [rows:1] INSERT INTO migrations (id) VALUES ("g20220802211500")
time="2023-12-29T14:38:53Z" level=info msg="SQLite global configuration migrations completed successfully" type=web
time="2023-12-29T14:38:58Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:58 +0000] \"HEAD /api/health\" 200 0 \"\" \"curl/7.74.0\" (1ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=1 method=HEAD path=/api/health referer= respLength=0 statusCode=200 type=web userAgent=curl/7.74.0
starting scrutiny collector (run-once mode. subsequent calls will be triggered via cron service)
2023/12/29 14:38:58 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.
 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                                dev-0.7.2
time="2023-12-29T14:38:58Z" level=info msg="Verifying required tools" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --scan --json" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --info --json /dev/sda" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Using WWN Fallback" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --info --json /dev/sdb" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Using WWN Fallback" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --info --json /dev/sdc" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Using WWN Fallback" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --info --json /dev/sdd" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Using WWN Fallback" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Sending detected devices to API, for filtering & validation" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:58 +0000] \"POST /api/devices/register\" 200 2075 \"\" \"Go-http-client/1.1\" (408ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=408 method=POST path=/api/devices/register referer= respLength=2075 statusCode=200 type=web userAgent=Go-http-client/1.1
time="2023-12-29T14:38:58Z" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --xall --json /dev/sda" type=metrics
time="2023-12-29T14:38:58Z" level=error msg="smartctl returned an error code (4) while processing sda\n" type=metrics
time="2023-12-29T14:38:58Z" level=error msg="smartctl detected a checksum error" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Publishing smartctl results for 2yjdn6sd\n" type=metrics
ts=2023-12-29T14:38:58.926265Z lvl=info msg="index opened with 8 partitions" log_id=0mPW3M0l000 service=storage-engine index=tsi
ts=2023-12-29T14:38:58.927330Z lvl=info msg="Reindexing TSM data" log_id=0mPW3M0l000 service=storage-engine engine=tsm1 db_shard_id=1
ts=2023-12-29T14:38:58.927369Z lvl=info msg="Reindexing WAL data" log_id=0mPW3M0l000 service=storage-engine engine=tsm1 db_shard_id=1
time="2023-12-29T14:38:58Z" level=info msg="No notification endpoints configured. Skipping failure notification." type=web
time="2023-12-29T14:38:58Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:58 +0000] \"POST /api/device/2yjdn6sd/smart\" 200 16 \"\" \"Go-http-client/1.1\" (140ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=140 method=POST path=/api/device/2yjdn6sd/smart referer= respLength=16 statusCode=200 type=web userAgent=Go-http-client/1.1
time="2023-12-29T14:38:58Z" level=info msg="Collecting smartctl results for sdb\n" type=metrics
time="2023-12-29T14:38:58Z" level=info msg="Executing command: smartctl --xall --json /dev/sdb" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl returned an error code (4) while processing sdb\n" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl detected a checksum error" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="Publishing smartctl results for 2yj8s5bd\n" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="No notification endpoints configured. Skipping failure notification." type=web
time="2023-12-29T14:38:59Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:59 +0000] \"POST /api/device/2yj8s5bd/smart\" 200 16 \"\" \"Go-http-client/1.1\" (337ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=337 method=POST path=/api/device/2yj8s5bd/smart referer= respLength=16 statusCode=200 type=web userAgent=Go-http-client/1.1
time="2023-12-29T14:38:59Z" level=info msg="Collecting smartctl results for sdc\n" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="Executing command: smartctl --xall --json /dev/sdc" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl returned an error code (4) while processing sdc\n" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl detected a checksum error" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="Publishing smartctl results for jehn4m1n\n" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="No notification endpoints configured. Skipping failure notification." type=web
time="2023-12-29T14:38:59Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:59 +0000] \"POST /api/device/jehn4m1n/smart\" 200 16 \"\" \"Go-http-client/1.1\" (204ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=204 method=POST path=/api/device/jehn4m1n/smart referer= respLength=16 statusCode=200 type=web userAgent=Go-http-client/1.1
time="2023-12-29T14:38:59Z" level=info msg="Collecting smartctl results for sdd\n" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="Executing command: smartctl --xall --json /dev/sdd" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl returned an error code (4) while processing sdd\n" type=metrics
time="2023-12-29T14:38:59Z" level=error msg="smartctl detected a checksum error" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="Publishing smartctl results for 2yjdutkd\n" type=metrics
time="2023-12-29T14:38:59Z" level=info msg="No notification endpoints configured. Skipping failure notification." type=web
time="2023-12-29T14:38:59Z" level=info msg="127.0.0.1 - 7e29e6c5589c [29/Dec/2023:14:38:59 +0000] \"POST /api/device/2yjdutkd/smart\" 200 16 \"\" \"Go-http-client/1.1\" (188ms)" clientIP=127.0.0.1 hostname=7e29e6c5589c latency=188 method=POST path=/api/device/2yjdutkd/smart referer= respLength=16 statusCode=200 type=web userAgent=Go-http-client/1.1

Please let me know if there is any further debugging or logging needed and I'll get to it.

Thanks,

mcarbonne commented 9 months ago

Maybe you can run smartctl from inside the scrutiny docker to obtain detailed logs :

sudo docker exec -it scrutiny /bin/sh

(replace scrutiny by the name of you running container) and then execute smartctl --xall /dev/sda.

paulmorabito commented 9 months ago

Here you go:

root@7e29e6c5589c:/opt/scrutiny# smartctl --xall /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.60-qnap] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WDC
Product:              WD100EMAZ-00WJTA
Revision:             83.H
Compliance:           SPC-3
User Capacity:        10,000,831,348,736 bytes [10.0 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        5400 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca273e1ef5b
Serial number:        2YJDN6SD
Device type:          disk
Local Time is:        Wed Jan  3 08:51:52 2024 UTC
SMART support is:     Unavailable - device lacks SMART capability.
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Device does not support Self Test logging
Device does not support Background scan results logging
mcarbonne commented 9 months ago

The issue is on smartmontools side. I don't know the kind of HW used in your NAS but maybe it prevents auto detection from smartctl. Nevertheless you can try to manually find a working configuration. From inside the scrutiny docker, you can run smartctl --xall --device=XXX /dev/sda (replace XXX by sat, scsi ...). There is a list of all available options in man pages (https://linux.die.net/man/8/smartctl).

If you succeed, then have a look at metrics_smart_args parameter. Default value is --xall --json but you can add extra required parameters for you drives (--xall --json --device sat for example).

paulmorabito commented 9 months ago

I found the missing parameter (--device=sat) and can run this successfully from the command line in the container. When I update the config though, it's not running the command with the addition of --device. I'm setting it according to below:

# Commented Scrutiny Configuration File
#
# The default location for this file is /scrutiny/config/collector.yaml.
# In some cases to improve clarity default values are specified,
# uncommented. Other example values are commented out.
#
# When this file is parsed by Scrutiny, all configuration file keys are
# lowercased automatically. As such, Configuration keys are case-insensitive,
# and should be lowercase in this file to be consistent with usage.

######################################################################
# Version
#
# version specifies the version of this configuration file schema, not
# the scrutiny binary. There is only 1 version available at the moment
version: 1

# This block allows you to override/customize the settings for devices detected by
# Scrutiny via `smartctl --scan`
# See the "--device=TYPE" section of https://linux.die.net/man/8/smartctl
# type can be a 'string' or a 'list'
devices:
  # example for forcing device type detection for a single disk
  - device: /dev/sda
    type: 'sat'
  - device: /dev/sdb
    type: 'sat'
  - device: /dev/sdc
    type: 'sat'
  - device: /dev/sdd
    type: 'sat'
commands:
  #  metrics_scan_args: '--scan --json' # used to detect devices
  #  metrics_info_args: '--info --json' # used to determine device unique ID & register device with Scrutiny
  metrics_smart_args: '--xall --device=sat --json' # used to retrieve smart data for each device.

Error from the logs below:

time="2024-01-04T14:06:58Z" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2024-01-04T14:06:58Z" level=info msg="Executing command: smartctl --xall --json /dev/sda" type=metrics
time="2024-01-04T14:06:58Z" level=error msg="smartctl returned an error code (4) while processing sda\n" type=metrics
time="2024-01-04T14:06:58Z" level=error msg="smartctl detected a checksum error" type=metrics
time="2024-01-04T14:06:58Z" level=info msg="Publishing smartctl results for 2yjdn6sd\n" type=metrics
time="2024-01-04T14:06:58Z" level=info msg="No notification endpoints configured. Skipping failure notification." type=web

I'm running the latest container version etc. Is there anything I am missing?

chrisuhg commented 9 months ago

The issue is on smartmontools side. I don't know the kind of HW used in your NAS but maybe it prevents auto detection from smartctl. Nevertheless you can try to manually find a working configuration. From inside the scrutiny docker, you can run smartctl --xall --device=XXX /dev/sda (replace XXX by sat, scsi ...). There is a list of all available options in man pages (https://linux.die.net/man/8/smartctl).

If you succeed, then have a look at metrics_smart_args parameter. Default value is --xall --json but you can add extra required parameters for you drives (--xall --json --device sat for example).

Many thanks for your comment :)

I got the same error in my all of SSD(SATA) on collector.log file: level=error msg="smartctl returned an error code (4) while processing sdf\n" type=metrics

and i went to docker exec -it scrutiny /bin/sh and ran smartctl --xall /dev/sde, it shown me:

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.4.302+] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WDC
Product:              WDS250G2B0B-00YS
Revision:             20WD
Compliance:           SPC-3
User Capacity:        250,059,350,016 bytes [250 GB]
Logical block size:   512 bytes
LU is fully provisioned
Rotation Rate:        Solid State Device
Logical Unit id:      0x5001b444a773882b
Serial number:        202********10
Device type:          disk
Local Time is:        Thu Jan  4 15:35:31 2024 UTC
SMART support is:     Unavailable - device lacks SMART capability.
Read Cache is:        Enabled
Writeback Cache is:   Enabled

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Device does not support Self Test logging
Device does not support Background scan results logging

I tired different args of the --device=TYPE:

smartctl --xall --device=ata /dev/sdj
smartctl --xall --device=scsi /dev/sdj
smartctl --xall --device=sat /dev/sdj

until to smartctl --xall --device=sat /dev/sdj, it shown all of the SMART informations ! What is certain is that this type is appropriate !

so i split the compose.yaml to compose.yaml and collector.yaml: compose.yaml

version: '3.5'

services:
  scrutiny:
    container_name: scrutiny
    image: ghcr.io/analogj/scrutiny:master-omnibus
    privileged: true    # !!PLEASE REMOVE WHEN WORKING!!
    cap_add:
      - SYS_RAWIO
      - SYS_ADMIN
    ports:
      - "8080:8080" # webapp
    environment:
      - PUID=1000
      - PGID=1000
      - DEBUG=true
      - COLLECTOR_LOG_FILE=/opt/scrutiny/config/collector.log
      - SCRUTINY_LOG_FILE=/opt/scrutiny/config/web.log
    volumes:
      - /run/udev:/run/udev:ro
      - ./config:/opt/scrutiny/config
      - ./influxdb:/opt/scrutiny/influxdb
    devices:    # if you will always run in "privileged" mode, you can remove this section
      - /dev/sda
      - /dev/nvme0

collector.yaml | path to ./config/collector.yaml

# Commented Scrutiny Configuration File
#
# The default location for this file is /opt/scrutiny/config/collector.yaml.
# In some cases to improve clarity default values are specified,
# uncommented. Other example values are commented out.
#
# When this file is parsed by Scrutiny, all configuration file keys are
# lowercased automatically. As such, Configuration keys are case-insensitive,
# and should be lowercase in this file to be consistent with usage.
######################################################################

# Version
# version specifies the version of this configuration file schema, not
# the scrutiny binary. There is only 1 version available at the moment
version: 1

# The host id is a label used for identifying groups of disks running on the same host
# Primiarly used for hub/spoke deployments (can be left empty if using all-in-one image).
host:
  id: ""

# This block allows you to override/customize the settings for devices detected by
# Scrutiny via `smartctl --scan`
# See the "--device=TYPE" section of https://linux.die.net/man/8/smartctl
# type can be a 'string' or a 'list'
devices:
  - device: /dev/sda
    type: 'ata'

  - device: /dev/sdb
    type: 'sat'

  - device: /dev/sde
    type: 'sat'

  - device: /dev/sdf
    type: 'ata'

  - device: /dev/sdj
    type: 'sat'

  - device: /dev/nvme0
    type: 'nvme'

remove files or folder to clean the cache:

re-build your scrutiny object and open your config/collector.log file, you should be see the level=info msg="Executing command: smartctl --info --json --device sat /dev/sdf" type=metrics when you searching smartctl --info --json --device sat

now everything is normal 👯

image

I made a lot of efforts to search for the section on collection.yaml or metrics_smart_args in issue or document of offical, But I can't see (or missed) any guides of the Synology Container Manager environment, I hope what i shared is helpful :D

paulmorabito commented 9 months ago

@chrisuhg Thanks for the info. Although, I'm not sure why you need to rebuild the container when my config files are stored outside of it and the container reads the config upon every start/restart?

chrisuhg commented 9 months ago

@chrisuhg Thanks for the info. Although, I'm not sure why you need to rebuild the container when my config files are stored outside of it and the container reads the config upon every start/restart?

I knew the container will reads config when every start option, but the Container Manager of the Synology NAS app can not following after i edited the compose.yaml file.

so I though the Start/Stop of the GUI button is equal docker run/stop , the option > build button is equal docker-compose -config compose.yml

image

Maybe I should attach more screenshot to make it easier to understand the "build" or "re-build"~

Anyway, thanks for your ask :D

paulmorabito commented 9 months ago

Ah yes, you make a good point. I use Portainer for managing my containers instead of QNAP's Container Manager.

In any case, the issue remains for me as it seems the global command for smartctl is not being read by the config.

Paul Morabito On 4 Jan 2024 at 19:17 +0100, Chrishg @.***>, wrote:

@chrisuhg Thanks for the info. Although, I'm not sure why you need to rebuild the container when my config files are stored outside of it and the container reads the config upon every start/restart? I knew the container will reads config when every start option, but the Container Manager of the Synology NAS app can not following after i edited the compose.yaml file. so I though the Start/Stop of the GUI button is equal docker run/stop , the option > build button is equal docker-compose -config compose.yml image.png (view on web) Maybe I should attach more screenshot to make it easier to understand the "build" or "re-build"~ Anyway, thanks for your ask :D — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

pheetr commented 8 months ago

I also have a QNAP and had the same issue as you, @paulmorabito. The advice from @chrisuhg helped me get it working. I just created the collector.yaml file in the config folder (where scrutiny.db is located) and specified all the drive types individually, as 'sat' in my case, as that was what yielded results in the scrutiny's docker console, with command smartctl --xall --device=sat /dev/sda Restarting the container afterwards was enough to get things going in my case.

Note: I've initially attempted to set the global command arguments in the config, since all my drives are SATA, but that didn't work for me.

collector.yaml:

# Commented Scrutiny Configuration File
#
# The default location for this file is /opt/scrutiny/config/collector.yaml.
# In some cases to improve clarity default values are specified,
# uncommented. Other example values are commented out.
#
# When this file is parsed by Scrutiny, all configuration file keys are
# lowercased automatically. As such, Configuration keys are case-insensitive,
# and should be lowercase in this file to be consistent with usage.

######################################################################
# Version
#
# version specifies the version of this configuration file schema, not
# the scrutiny binary. There is only 1 version available at the moment
version: 1

# The host id is a label used for identifying groups of disks running on the same host
# Primiarly used for hub/spoke deployments (can be left empty if using all-in-one image).
host:
  id: ""

# This block allows you to override/customize the settings for devices detected by
# Scrutiny via `smartctl --scan`
# See the "--device=TYPE" section of https://linux.die.net/man/8/smartctl
# type can be a 'string' or a 'list'
devices:
#  # example for forcing device type detection for a single disk
  - device: /dev/sda
    type: 'sat'
  - device: /dev/sdb
    type: 'sat'
  - device: /dev/sdc
    type: 'sat'
  - device: /dev/sdd
    type: 'sat'
  - device: /dev/sde
    type: 'sat'
  - device: /dev/sdf
    type: 'sat'
  - device: /dev/sdg
    type: 'sat'
paulmorabito commented 8 months ago

Thanks for the reply @pheetr. I've taken a look with fresh eyes and my config was pointing to the wrong location. I don't check Scrutiny very often and perhaps at some point it was changed to opt/scrutiny/config?

In any case, I had "type: sat" previously set so pointing to the correct config sorted that. I also noticed on restarting that I now have 4 "[/DEV/ -" devices that can't be clicked on or deleted. There was quite a few DB migrations on restart so its perhaps a side effect of it?

Also noted that the global "commands" doesn't seem to work but that's a separate issue to this. I'll close for now as the reported issue is fixed.