Open tonyzzz321 opened 11 months ago
This commit looks to try to fix it, but perhaps it's what broke it? I'm getting the same error on my end now.
I also read here that this project is more or less unmaintained, so if you want it fixed, you might have to submit your own PR or fork it.
Changing the thresholds didn't cause any problems. More likely what happened is you had timeouts before, but they just weren't >5 or >7.5s. Then when you got those longer timeouts, the incorrectly decoded value went above the thresholds, causing the error. I'm seeing the same thing on one of my drives so I might tackle this when I have some time.
I'm also seeing some sector errors on a drive with <1yr runtime. I'm wondering if there's some decoding error on those too? or maybe I'm just on the front end of the bathtub curve...
To answer your two questions:
To answer your two questions:
1. No, this was not the behavior before. The Command Timeouts were warnings, not errors. They only recently started turning into errors, and marking the drive as failed. Your drive, in the screenshot, has reported 1 command timeout in 65 616 operations. 2. Your drive is failing, and you should replace it. Those values do not need additional parsing to be accurate, that's only for 1, 7, 188, 195.
Thanks. For point 1, the Command Timeout was giving me an error with a raw value of ~8 before I submitted the threshold change. So the behavior must've been changed in between. Regardless, with the decoding corrected, this should go away.
Also - where are you seeing the total number of operations?
You can also use the -v
flag to tell smartctl to parse the value as three raw 16-bit values to get an accurate result:
sudo smartctl -xv 188,raw16 /path/to/disk
@goproslowyo Good tip there! I used this to customise the metrics_smart_args
command for my Seagate drive:
- device: /dev/sde
type: 'sat'
commands:
metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
Which products JSON output as:
"raw": {
"value": 8590065666,
"string": "2 2 2"
}
However Scrutiny still parses the value
, it would seem
Yeap, I tried to do exactly the same thing @itsthejb but it seems something interprets the value incorrectly anyway.
@AnalogJ - I think this would be a pretty quick and easy fix for someone who actually knows Go... can you to take a look? I'm also happy to write the code and test if you tell me what to do
In the meantime you can set "Device status - thresholds" to just "SMART" instead of "Scrutiny" or "both", to ignore Scrutiny's interpretation. Note that this will ignore it for all attributes...
@zuavra This is what I have done so far, but I would love to revert back to "both" when this is fixed.
I took a quick look at the code, and this should be a super easy fix. Just need to add a Transform()
function to the ata attribute here that looks at the string value. If it has 3 parts, then you just grab the last one. smartctl itself already sets -v 188,raw16
for many seagate drives.
I’d love to see it too! Since Scrutiny output has convinced me that one of my other drives is definitely expiring, be nice to see green on the seagate which is in fact still ok
@itsthejb @firasdib @tonyzzz321 - I've fixed this in my fork: https://github.com/kaysond/scrutiny/tree/master
Can you please help test this
@kaysond Looking good here! Good job
Hey everyone, thanks for collaborating and figuring this one out. I don't have any seagate drives effected by this issue, so I was depending on the community to help figure out what's going on -- and you delivered!
I'll be merging the PR momentarily
I've updated scrutiny web as well as the collector manual install I have on TrueNAS that has some drives affected by this issue and scrutiny is still reporting the raw value and showing failed for 188. The raw value on two of my drives is 8590065666.
Or are these drives still failing?
Same here, still getting the error for docker image omnibus
sha256:d45a226d02eb38f82574a552299eb3440c3f398674e92d596e0051e85b2bab48
Or are these drives still failing?
There doesn't seem to be any transformation done between the raw value and the value marked "Scrutiny".
With the latest version of image beta:omnibus
the attribute shows as warning rather than error, but there still doesn't seem to be any transformation from the raw value, and it still causes the overal drive status to be "failed".
I also have several Seagate drives with a high parameter of 188, this is caused by a problem with the HBA card when building the server, the card was replaced, the problem was eliminated, the drives are functional, but the parameter is still high in the drives that were launched and tested when the problem with the HBA card occurred. example parameter value 17180131333.
It seems that you should simply monitor the rate of increase of this parameter, if the parameter remains stagnant, the problem should be considered solved.
re-opening this since there still seems to be a problem.
One thing to note, beta-omnibus
is 12 commits behind main
. The "fix" for this issue should be in main already, I'll be updating beta momentarily to alleviate any confusion.
just to confirm, @Brandoskey @zuavra are you running the scrutiny collector with a config file containing:
- device: /dev/sd[X]
type: 'sat'
commands:
metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
If you just run smartctl manually on the drive, what does it show for attribute 188? If it shows something like 1 2 30
then it should get parsed correctly. If not, you'll need to add the config file as above.
keep in mind there's a raw value and a raw string value, and they may not be the same 😵💫
keep in mind there's a raw value and a raw string value, and they may not be the same 😵💫
Yes. If you just run the command, it shows the string value:
And the json gives both:
{
"id": 188,
"name": "Command_Timeout",
"value": 100,
"worst": 43,
"thresh": 0,
"when_failed": "",
"flags": {
"value": 50,
"string": "-O--CK ",
"prefailure": false,
"updated_online": true,
"performance": false,
"error_rate": false,
"event_count": true,
"auto_keep": true
},
"raw": {
"value": 4295032912,
"string": "1 1 80"
}
},
just to confirm, @Brandoskey @zuavra are you running the scrutiny collector with a config file containing:
- device: /dev/sd[X] type: 'sat' commands: metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
I do not run with a config file. I just run the command with host-id and api-endpoint params set. This is on TrueNAS so I was attempting to change as little of the underlying system as possible.
I take it a config and the arguments you added are required for the fix to work?
just to confirm, @Brandoskey @zuavra are you running the scrutiny collector with a config file containing:
- device: /dev/sd[X] type: 'sat' commands: metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
Thank you, that's what was missing, I can now see the transformed values.
By the way, since device letters change on reboot... should I add these commands to all drives regardless of make, and rely on the fact that non-Seagate drives don't have attribute 188?
Also, I notice that the attribute is still marked as "warning" (for Seagate drives) even when the raw value is zero.
After adding collector.yaml with metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
it does indeed work. Is there no way to bake this in though? I have a lot of seagate drives on a lot of machines, this could get very tedious.
I did find that adding the arguments globally to all drives didn't seem to break anything on my non seagate drives.
Edit: after setting alerts back to both, I also see the drives are still marked as failed, basically everything @zuavra reported
@zuavra - you're seeing it as a warning because you're right on the edge of the arbitrary threshold: https://github.com/AnalogJ/scrutiny/blob/c3a0fb7fb526d3e74218a5b04e9c6685007411bf/webapp/backend/pkg/thresholds/ata_attribute_metadata.go#L696-L708
The problem is that the backblaze data shows a 2% failure rate for drives with 0 command timeouts, but >10% for anything between 1 and 13 billion. I picked 100 arbitrarily because 10% or higher causes scrutiny to consider it as an error:
I think backblaze has their data divided the way they do because the high end is so high. The correct thing to do here is either analyze the raw data, or email them to find out exactly how many command timeouts corresponds to 10% failure rate. With that number we can easily update the thresholds.
After adding collector.yaml with
metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
it does indeed work. Is there no way to bake this in though? I have a lot of seagate drives on a lot of machines, this could get very tedious.I did find that adding the arguments globally to all drives didn't seem to break anything on my non seagate drives.
Edit: after setting alerts back to both, I also see the drives are still marked as failed, basically everything @zuavra reported
@Brandoskey - I think the right thing to do is raise this as an issue for smartmontools
They already have some code to enable this automatically for seagate drives, but for some reason I guess it's not catching your drives: https://github.com/smartmontools/smartmontools/blob/a03301953f292c54116642c683cd8e4fcd43f0b6/smartmontools/drivedb.h#L4256
If you tell them your drive model and firmware they should be able to add it to the list, then scrutiny can use the latest release of smartmontools when it's added.
@kaysond Yes it seems that the 1-13G interval is statistically significant and distinct enough from 13G-26G and other slices.
It's all for the best though, I wouldn't have paid attention to 188 if it weren't for Scrutiny!
One more question, can Scrutiny plot 188 on the graph or does it only plot temperature?
So I just deployed this via docker across my systems and sure enough 3 of my drives on one system are showing these errors. Are these all failing or is It still part of the bug? I have a few systems with Seagates and the others are fine so wondering if it is my 3 drives.
Pastebin dumps of each below:
https://paste.debian.net/hidden/020d1648/ https://paste.debian.net/hidden/441bee49/ https://paste.debian.net/hidden/e70718aa/
Thanks for the help!
So I just deployed this via docker across my systems and sure enough 3 of my drives on one system are showing these errors. Are these all failing or is It still part of the bug? I have a few systems with Seagates and the others are fine so wondering if it is my 3 drives.
Pastebin dumps of each below:
https://paste.debian.net/hidden/020d1648/ https://paste.debian.net/hidden/441bee49/ https://paste.debian.net/hidden/e70718aa/
Thanks for the help!
Can you share a screenshot? Are you using the latest image?
@pyrodex - not showing up. I think you can't do images via email
So I just deployed this via docker across my systems and sure enough 3 of my drives on one system are showing these errors. Are these all failing or is It still part of the bug? I have a few systems with Seagates and the others are fine so wondering if it is my 3 drives. Pastebin dumps of each below: https://paste.debian.net/hidden/020d1648/ https://paste.debian.net/hidden/441bee49/ https://paste.debian.net/hidden/e70718aa/ Thanks for the help!
Can you share a screenshot? Are you using the latest image?
I am using the master collector image and here is the screen shot
I was able to fix my drives with the collect.yaml passed through to the container with the fixed settings above suggested.
I was able to fix my drives with the collect.yaml passed through to the container with the fixed settings above suggested.
Could you share how you did that? I am using scrutiny in the hub-and-spoked model where the hub is in docker and the spokes on different servers. What do I need to do to get this reported properly? I do not have any files on the spokes except the script that runs on the crontab
I got the same issue on Seagate ST4000VN008 which really scared me.😂
I got three red alerts😂
Maybe I will try read source code in this weekend.
I got the same issue on Seagate ST4000VN008 which really scared me.😂
I got three red alerts😂
Maybe I will try read source code in this weekend.
You just have 1 timeout error that was >5s. If you use the collector yaml mentioned above it will parse correctly
I was able to fix my drives with the collect.yaml passed through to the container with the fixed settings above suggested.
Could you share how you did that? I am using scrutiny in the hub-and-spoked model where the hub is in docker and the spokes on different servers. What do I need to do to get this reported properly? I do not have any files on the spokes except the script that runs on the crontab
So I run the collectors as containers also since most of my hosts that aren't the hub are physical and here is that compose:
version: "3.0"
services:
collector:
container_name: scrutiny-collector
image: 'ghcr.io/analogj/scrutiny:master-collector'
cap_add:
- SYS_RAWIO
- SYS_ADMIN
volumes:
- '/run/udev:/run/udev:ro'
- '/apps/scrutiny-collector:/opt/scrutiny/config'
environment:
COLLECTOR_API_ENDPOINT: http://<ip>:8080
COLLECTOR_RUN_STARTUP: true
restart: unless-stopped
Then I also created the mapped volume to /apps/scrutiny-collector where I placed the collector.yaml since I couldn't find a way to do the customization via the ENV variables of the container.
Here is what mine collector.yaml looks like:
version: 1
host:
id: "hostname"
devices:
- device: /dev/sda
type: 'sat'
commands:
metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
- device: /dev/sdb
type: 'sat'
commands:
metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
- device: /dev/sdc
type: 'sat'
commands:
metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
api:
endpoint: 'http://<ip>:8080'
Works! Thanks a lot. I had to remove the disk once from Scrutiny, but now it shows it correctly. Appreciate your help
Since the manual install doesn't persist on Truenas through upgrades I've switched to the truecharts install. I have scrutiny reporting to my main docker instance, however I can't find a way to pass the arguments to report 188 correctly.
I tried mounting the directory my collector.yaml is in to /opt/scrutiny/config but the container refuses to start, also the container is not technically a collector to begin with so I doubt this would work.
Is there a way to get these argument implewmented in the true chart version that I can't figure out?
Same here with WDC WUH721414ALE6L4. There is 4 HDD with two different firmware and different result. collector.yaml changed:
devices:
- device: /dev/sda
type: 'sat'
commands:
metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
- device: /dev/sdb
type: 'sat'
commands:
metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
I have HDSentinel report:
I don't think this should be fixed in scrutiny but on smartmontools. If you have a similar issue it should be reported and eventually fixed inside the drivedb.h This can take a while though.
So here is how to change the default configuration for your drive in smartmontools and apply that to scrutiny as well.
This means both smartcl -a /dev/sda
in your system and scrutiny inside the container will give you correct results.
/etc/smartmontools/smart_drivedb.h
/// Get the default path for this file in your system: smartctl -h | grep 'default is'
// Get device model smartctl -i /dev/sda (Also shows if device in database or not, which can be used to detect if your custom config is working)
// Find your drive entry in the database if any: rg -C 6 '$DRIVE_MODEL' /usr/share/smartmontools/drivedb.h
// If you doubt if a value from a seagate drive is correct compare it with: openSeaChest_SMART -d /dev/sda --smartAttributes analyzed
// First line is the family, second line the model
// In this case is using a regex to match several models but you could replace it with your specific model
// I'm only adding -v 1,7,188. the other configs were in the built-in drivedb
{ "Seagate Exos X16", // tested with ST10000NM001G-2MW103/SN02
// ST14000NM001G-2KJ103/SN02, ST16000NM001G-2KK103/SN02, ST16000NM001G-2KK103/SN03
"ST1[0246]000NM00[13][GJ]-.*",
"", "",
"-v 1,raw24/raw32,Raw_Read_Error_Rate "
"-v 7,raw24/raw32,Seek_Error_Rate "
"-v 18,raw48,Head_Health "
"-v 188,raw16,Command_Timeout "
"-v 200,raw48,Pressure_Limit "
"-v 240,msec24hour32"
},
Now smartctl -a /dev/sda | grep 188
should interpret the value correctly, in my case:
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 1 1 4
If you have a Seagate drive you should see three values at the end, indicating the different types of command timeouts.
Now let's fix it in scrutiny (I'm using a podman quadlet file but should be easy to translate t o docker)
# Make scrutiny use you custom drivedb.h
# This can be used to fix drive configs system wide instead of drive by drive
# Get the default path used inside the container
# podman exec -it scrutiny /bin/sh # AFAIK it's the same command on docker (my container is named scrutiny, change it if necessary)
# smartctl -h | grep 'default is'
# /etc/smart_drivedb.h
# Mount the path from your system on the container default path
# Remove the `z,` at the end if you don't use selinux
Volume=/etc/smartmontools/smart_drivedb.h:/etc/smart_drivedb.h:z,ro
If you restart the container now, the current value will be correct but it will still remember the old incorrect values and detect the drive as failed. If you don't want to lose the entire history for your drive it could be possible to edit the influxdb of the container. I didn't bother. Just remove the drive using the webui and restart the container.
@zuavra - you're seeing it as a warning because you're right on the edge of the arbitrary threshold:
The problem is that the backblaze data shows a 2% failure rate for drives with 0 command timeouts, but >10% for anything between 1 and 13 billion. I picked 100 arbitrarily because 10% or higher causes scrutiny to consider it as an error:
I think backblaze has their data divided the way they do because the high end is so high. The correct thing to do here is either analyze the raw data, or email them to find out exactly how many command timeouts corresponds to 10% failure rate. With that number we can easily update the thresholds.
So for this case as long as my raw value didn't change and stays 0(or lower than 10) it should be fine? I have 7 different drives which all show the same warning.
I got the same Problem:
Based on this it's just 2 errors.
Additionally, smartctl -xv 188,raw16 /dev/sdc | grep 188
gives me:
188 Command_Timeout -O--CK 100 099 000 - 2 2 2
Are there any updates to this, or should I go with @unai-ndz 's solution?
I got the same Problem:
Based on this it's just 2 errors.
Additionally,
smartctl -xv 188,raw16 /dev/sdc | grep 188
gives me:188 Command_Timeout -O--CK 100 099 000 - 2 2 2
Are there any updates to this, or should I go with @unai-ndz 's solution?
I'd say this is working as intended. If you want to have the formatting applied automatically, you should submit an issue/pull request to the smartmontools repo.
Describe the bug I have a couple Seagate drives showing huge raw value for 188 Command Timeout, and it is marked as failed in Scrutiny. Please see screenshot below.
Seagate drives use this field's raw value to represent a combination of 3 integers (total command timeouts, commands completed between 5s and 7.5s, commands completed >7.5s). Therefore, the raw value needs to be decoded before being used to determine drive's failure.
In my case, the raw value of "4295032833" represents 1 command timeout, 1 command >5s and <7.5s, and 1 command >7.5s. This does not cross the threshold to be considered as failure.
Please see related answer at https://superuser.com/a/1747851 and Seagate SMART Attribute Spec documentation.
Expected behavior Raw value to be decoded before being used to determine drive's failure.
Screenshots