[BUG] Incorrect drive report

boomam commented 12 months ago

Describe the bug Understanding the guidence here, the Crucial MX500 array I am monitoring reads as failing SMART, when a manual run of smartctl on the host being monitored via the collector, shows the drive as passed/working.

There is suggestion of 'exit code 4', from the collector, but the drives otherwise report stats. Running the command referenced on the collector manually, smartctl --xall --json /dev/sdg" type=metrics does not actually work, instead hanging on a > prompt.

Net result is incorrect status in GUI, and thus incorrect reporting via notification methods.

Expected behavior Correct reporting of SMART state/attributes.

Screenshots Not a screenshot, but localized output of smartctl -

user@Server:~# smartctl --all /dev/sdg
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.36-Unraid] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     CT1000MX500SSD1
Serial Number:    XTZ123456
LU WWN Device Id: XTZ123456
Firmware Version: M3CR046
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database 7.3/5440
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul  5 10:10:49 2023 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: **PASSED**

Log Files From collector.

time="2023-07-05T10:10:10-04:00" level=info msg="Collecting smartctl results for sdg\n" type=metrics
time="2023-07-05T10:10:10-04:00" level=info msg="Executing command: smartctl --xall --json /dev/sdg" type=metrics
time="2023-07-05T10:10:10-04:00" level=error msg="smartctl returned an error code (4) while processing sdg\n" type=metrics
time="2023-07-05T10:10:10-04:00" level=error msg="smartctl detected a checksum error" type=metrics

in another terminal trigger the collector
docker exec scrutiny scrutiny-collector-metrics run

INFO[0000] Verifying required tools                      type=metrics
INFO[0000] Executing command: smartctl --scan --json     type=metrics
IINFO[0000] Executing command: smartctl --info --json /dev/sdg  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0003] Collecting smartctl results for sdg           type=metrics
INFO[0003] Executing command: smartctl --xall --json /dev/sdg  type=metrics
ERRO[0003] smartctl returned an error code (4) while processing sdg  type=metrics
ERRO[0003] smartctl detected a checksum error            type=metrics
INFO[0003] Publishing smartctl results for XYZ12345  type=metrics
INFO[0014] Main: Completed                               type=metrics

The log files will be available on your host in the config directory. Please attach them to this issue. no logs exist

Please also provide the output of docker info
n/a

AnalogJ commented 12 months ago

hi @boomam when running smartctl within the container manually (for testing) you should use smartctl --xall --json /dev/sdg not smartctl --xall --json /dev/sdg" type=metrics

boomam commented 12 months ago

hi @boomam when running smartctl within the container manually (for testing) you should use smartctl --xall --json /dev/sdg not smartctl --xall --json /dev/sdg" type=metrics

Was just running what was in the log output ;-)

Running the new command generates an extensive output of smart stats - do you want them for diagnosis?

AnalogJ commented 12 months ago

just so I understand the issue a bit better, can you clarify a couple of things?

what is the error in the UI exactly? Can you include screenshots?
checksum errors in smartctl are usually handled correctly (ignored) as long as the JSON payload is correct (and SMART status is "passed"). Are you seeing a failed SMART attribute in the UI?
- this could be due to Backblaze analysis

boomam commented 12 months ago

what is the error in the UI exactly? Can you include screenshots? Attached.

Are you seeing a failed SMART attribute in the UI? this could be due to Backblaze analysis.

Yes, the attributes mostly all show.
Capacity doesn't show however, but im not too concerned with that right now.

Re: Backblaze -
Possibly, I know the MX500's v1 got a bad rap for failures based on one of the attributes incorrectly reporting, however a few firmware updates and a new revision later and its all good.
If Backblaze is used as the source of truth, then that could explain flagging it as failed i guess?

Although if that is the case, I would probably suggest we look at changing the status from 'failed' to something that shows its not failed, but is likely to based on historical nature.

AnalogJ / scrutiny

[BUG] Incorrect drive report #496