[BUG] Checksum errors - Githubissues

tommyalatalo commented 2 years ago

Describe the bug scrutiny 0.4.9 reports checksum errors, this happens for me on several of disks across multiple hosts.

Expected behavior No checksum errors should be reported, a couple of the disks in question are brand new and highly unlikely to have any actual error of this kind.

Logs

root@e221414acf14:/scrutiny# scrutiny-collector-metrics run
2022/06/06 10:30:32 Loading configuration file: /opt/scrutiny/config/collector.yaml

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                                dev-0.4.6

INFO[0000] Verifying required tools                      type=metrics
INFO[0000] Executing command: smartctl --scan -j         type=metrics
INFO[0000] Executing command: smartctl --info -j /dev/sda  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j /dev/sdb  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl --info -j -d nvme /dev/nvme0  type=metrics
INFO[0000] Using WWN Fallback                            type=metrics
INFO[0000] Sending detected devices to API, for filtering & validation  type=metrics
INFO[0000] Collecting smartctl results for sda           type=metrics
INFO[0000] Executing command: smartctl -x -j /dev/sda    type=metrics
ERRO[0000] smartctl returned an error code (4) while processing sda  type=metrics
ERRO[0000] smartctl detected a checksum error            type=metrics
INFO[0000] Publishing smartctl results for 0x5000c500e0ec6daa  type=metrics
INFO[0003] Collecting smartctl results for sdb           type=metrics
INFO[0003] Executing command: smartctl -x -j /dev/sdb    type=metrics
ERRO[0004] smartctl returned an error code (4) while processing sdb  type=metrics
ERRO[0004] smartctl detected a checksum error            type=metrics
INFO[0004] Publishing smartctl results for 0x5000c500e0ec8167  type=metrics
INFO[0006] Collecting smartctl results for nvme0         type=metrics
INFO[0006] Executing command: smartctl -x -j -d nvme /dev/nvme0  type=metrics
INFO[0006] Publishing smartctl results for s64bnj0r201688p  type=metrics
INFO[0006] Main: Completed                               type=metrics

I've tried setting my devices to both ata and sat types in the collector.yaml file according to the discussion in #251, it doesn't seem to make any difference in my case.

This issue is essentially the same as #251 which was closed prematurely and seems to receive no further replies.

AnalogJ commented 2 years ago

unfortunately those checksum error codes are from smartctl not scrutiny, so its not something I can control:

https://github.com/AnalogJ/scrutiny/blob/master/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md#exit-codes

You may want to try running the smartctl tool directly within the container, and debug that way to figure out whats going on. Another (less secure) option would be to use --privileged and -v /dev:/dev to workaround device permissions when running in a container:

https://github.com/AnalogJ/scrutiny/blob/master/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md#volume-mount-all-devices-dev---privileged

Can you try those and get back to me @altosys ?

kjames2001 commented 2 years ago

i have the same issue, tried --privileged and -v /dev:/dev and same result.

AnalogJ commented 2 years ago

@kjames2001 have you been able to get smartctl working successfully on your host? (not inside the scrutiny container?)

kjames2001 commented 2 years ago

it seems to be working, output is too long to fit the screen though.

tommyalatalo commented 2 years ago

Okay, I've done some testing, @AnalogJ .

I have three hosts, arch, nas and backup. On arch I have one disk failing the tests (smartctl -x -j /dev/sdc) with error 64, which indicates an actual disk failure. I can see this as possible since the disk is very old, so probably correct.

On backup I have two disks that are about two months or so old, where sda exits with 0 and sdb exits with code 4. I'm not expecting either of these disks to have actual errors, so how do I find out why I'm getting error 4? And also, since sda is exiting with 0 I don't see why scrutiny is reporting errors for it.

On nas I have an error coming from an nvme disk, which I can't see with lsblk in the container, though I've mounted it like below. I have both sys_admin and sys_raw added to the container as well. Scrutiny however reports errors for this device and seems to be able to read its size etc.

        devices = [
          {
            host_path      = "/dev/nvme0"
            container_path = "/dev/nvme0"
          },
       ]

AnalogJ commented 2 years ago

@altosys I put together a quick table mapping the smartctl exit codes with their explainations: https://github.com/AnalogJ/scrutiny/blob/master/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md#exit-codes

64 - isn't a catastrophic disk failure, just that there are some errors in the SMART log (which scrutiny doesnt parse yet)
4 - usually means the command failed (could be due to permissions errors), or corruption of the smart response.

regarding why /dev/sda has an error in scrutiny, but not from SMART -- this could be related to Scrutiny's predictive analysis of your disk using BackBlaze failure data.

tommyalatalo commented 2 years ago

So focusing on one thing; looking at the failures on my backup host, I get error 4 from smartctl --xall --json /dev/sda both in the container (running in privileged mode) and using the command with sudo on the host itself. How do I find out why this is not returning 0, because it seems that the values are still read by scrutiny?

tommyalatalo commented 2 years ago

So I've managed to correct all errors for my disks apart from my two newest devices which are on my backup server. The disks on this host are two Seagate 2.5" mechanical drives.

I managed to get rid of the checksum error on these disks by changing the --xall option to the full verbose

-H -i -g all -g wcreorder -c -A -f brief -l xerror,error -l xselftest,selftest -l selective -l directory -l scttemp -l devstat -l defects --json /dev/sda;

There was another flag inxluded in --xall called -l scterc, this was the one that caused the checksum error, and once I removed it from the smartctl command I am no longer getting those errors.

However, I'm still getting error registered from these two disks which is quite odd, as they are less than 3 months old (uptime is 78 days), but both of them have smart tests failing on Seek Error Rate according to scrutiny. I read a bit about this and it seems that it might be that Seagate reports these errors in a different way than other manufacturers, but I'm not entirely sure about this.

In the scrutiny UI I currently have this line showing a failed test for both disks:

Status   ID         Name               Value  Worst  Threshold  Ideal   Failure Rate
FAILED | 7 (0x07) | Seek Error Rate   | 75  | 60    | 45       |      | 20%

AnalogJ / scrutiny

[BUG] Checksum errors #284