Open tommyalatalo opened 2 years ago
unfortunately those checksum error codes are from smartctl
not scrutiny, so its not something I can control:
https://github.com/AnalogJ/scrutiny/blob/master/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md#exit-codes
You may want to try running the smartctl tool directly within the container, and debug that way to figure out whats going on. Another (less secure) option would be to use --privileged
and -v /dev:/dev
to workaround device permissions when running in a container:
Can you try those and get back to me @altosys ?
i have the same issue, tried --privileged and -v /dev:/dev and same result.
@kjames2001 have you been able to get smartctl working successfully on your host? (not inside the scrutiny container?)
it seems to be working, output is too long to fit the screen though.
Okay, I've done some testing, @AnalogJ .
I have three hosts, arch
, nas
and backup
.
On arch I have one disk failing the tests (smartctl -x -j /dev/sdc
) with error 64, which indicates an actual disk failure. I can see this as possible since the disk is very old, so probably correct.
On backup
I have two disks that are about two months or so old, where sda exits with 0 and sdb exits with code 4. I'm not expecting either of these disks to have actual errors, so how do I find out why I'm getting error 4? And also, since sda is exiting with 0 I don't see why scrutiny is reporting errors for it.
On nas
I have an error coming from an nvme disk, which I can't see with lsblk
in the container, though I've mounted it like below. I have both sys_admin
and sys_raw
added to the container as well. Scrutiny however reports errors for this device and seems to be able to read its size etc.
devices = [
{
host_path = "/dev/nvme0"
container_path = "/dev/nvme0"
},
]
@altosys I put together a quick table mapping the smartctl
exit codes with their explainations: https://github.com/AnalogJ/scrutiny/blob/master/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md#exit-codes
64
- isn't a catastrophic disk failure, just that there are some errors in the SMART log (which scrutiny doesnt parse yet)4
- usually means the command failed (could be due to permissions errors), or corruption of the smart response. regarding why /dev/sda
has an error in scrutiny, but not from SMART -- this could be related to Scrutiny's predictive analysis of your disk using BackBlaze failure data.
So focusing on one thing; looking at the failures on my backup
host, I get error 4 from smartctl --xall --json /dev/sda
both in the container (running in privileged mode) and using the command with sudo
on the host itself. How do I find out why this is not returning 0, because it seems that the values are still read by scrutiny?
So I've managed to correct all errors for my disks apart from my two newest devices which are on my backup server. The disks on this host are two Seagate 2.5" mechanical drives.
I managed to get rid of the checksum error on these disks by changing the --xall
option to the full verbose
-H -i -g all -g wcreorder -c -A -f brief -l xerror,error -l xselftest,selftest -l selective -l directory -l scttemp -l devstat -l defects --json /dev/sda;
There was another flag inxluded in --xall
called -l scterc
, this was the one that caused the checksum error, and once I removed it from the smartctl command I am no longer getting those errors.
However, I'm still getting error registered from these two disks which is quite odd, as they are less than 3 months old (uptime is 78 days), but both of them have smart tests failing on Seek Error Rate
according to scrutiny. I read a bit about this and it seems that it might be that Seagate reports these errors in a different way than other manufacturers, but I'm not entirely sure about this.
In the scrutiny UI I currently have this line showing a failed test for both disks:
Status ID Name Value Worst Threshold Ideal Failure Rate
FAILED | 7 (0x07) | Seek Error Rate | 75 | 60 | 45 | | 20%
Describe the bug scrutiny 0.4.9 reports checksum errors, this happens for me on several of disks across multiple hosts.
Expected behavior No checksum errors should be reported, a couple of the disks in question are brand new and highly unlikely to have any actual error of this kind.
Logs
I've tried setting my devices to both
ata
andsat
types in the collector.yaml file according to the discussion in #251, it doesn't seem to make any difference in my case.This issue is essentially the same as #251 which was closed prematurely and seems to receive no further replies.