AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
5.03k stars 164 forks source link

[BUG] Seagate Drive Command Timeout with Huge Raw Value #522

Open tonyzzz321 opened 11 months ago

tonyzzz321 commented 11 months ago

Describe the bug I have a couple Seagate drives showing huge raw value for 188 Command Timeout, and it is marked as failed in Scrutiny. Please see screenshot below.

Seagate drives use this field's raw value to represent a combination of 3 integers (total command timeouts, commands completed between 5s and 7.5s, commands completed >7.5s). Therefore, the raw value needs to be decoded before being used to determine drive's failure.

In my case, the raw value of "4295032833" represents 1 command timeout, 1 command >5s and <7.5s, and 1 command >7.5s. This does not cross the threshold to be considered as failure.

Please see related answer at https://superuser.com/a/1747851 and Seagate SMART Attribute Spec documentation.

Expected behavior Raw value to be decoded before being used to determine drive's failure.

Screenshots image image

firasdib commented 11 months ago

This commit looks to try to fix it, but perhaps it's what broke it? I'm getting the same error on my end now.

I also read here that this project is more or less unmaintained, so if you want it fixed, you might have to submit your own PR or fork it.

kaysond commented 11 months ago

This commit looks to try to fix it, but perhaps it's what broke it? I'm getting the same error on my end now.

I also read here that this project is more or less unmaintained, so if you want it fixed, you might have to submit your own PR or fork it.

Changing the thresholds didn't cause any problems. More likely what happened is you had timeouts before, but they just weren't >5 or >7.5s. Then when you got those longer timeouts, the incorrectly decoded value went above the thresholds, causing the error. I'm seeing the same thing on one of my drives so I might tackle this when I have some time.

kaysond commented 11 months ago

I'm also seeing some sector errors on a drive with <1yr runtime. I'm wondering if there's some decoding error on those too? or maybe I'm just on the front end of the bathtub curve...

image

firasdib commented 11 months ago

To answer your two questions:

  1. No, this was not the behavior before. The Command Timeouts were warnings, not errors. They only recently started turning into errors, and marking the drive as failed. Your drive, in the screenshot, has reported 1 command timeout in 65 616 operations.
  2. Your drive is failing, and you should replace it. Those values do not need additional parsing to be accurate, that's only for 1, 7, 188, 195.
kaysond commented 11 months ago

To answer your two questions:

1. No, this was not the behavior before. The Command Timeouts were warnings, not errors. They only recently started turning into errors, and marking the drive as failed. Your drive, in the screenshot, has reported 1 command timeout in  65 616 operations.

2. Your drive is failing, and you should replace it. Those values do not need additional parsing to be accurate, that's only for 1, 7, 188, 195.

Thanks. For point 1, the Command Timeout was giving me an error with a raw value of ~8 before I submitted the threshold change. So the behavior must've been changed in between. Regardless, with the decoding corrected, this should go away.

kaysond commented 11 months ago

Also - where are you seeing the total number of operations?

firasdib commented 11 months ago

I used this: https://www.disktuna.com/big-scary-raw-s-m-a-r-t-values-arent-always-bad-news/#21475164165

goproslowyo commented 11 months ago

You can also use the -v flag to tell smartctl to parse the value as three raw 16-bit values to get an accurate result:

sudo smartctl -xv 188,raw16 /path/to/disk

itsthejb commented 11 months ago

@goproslowyo Good tip there! I used this to customise the metrics_smart_args command for my Seagate drive:

 - device: /dev/sde
   type: 'sat'
   commands:
      metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'

Which products JSON output as:

        "raw": {
          "value": 8590065666,
          "string": "2 2 2"
        }

However Scrutiny still parses the value, it would seem

goproslowyo commented 11 months ago

2023-10-12_021052

Yeap, I tried to do exactly the same thing @itsthejb but it seems something interprets the value incorrectly anyway.

kaysond commented 11 months ago

@AnalogJ - I think this would be a pretty quick and easy fix for someone who actually knows Go... can you to take a look? I'm also happy to write the code and test if you tell me what to do

zuavra commented 11 months ago

In the meantime you can set "Device status - thresholds" to just "SMART" instead of "Scrutiny" or "both", to ignore Scrutiny's interpretation. Note that this will ignore it for all attributes...

firasdib commented 11 months ago

@zuavra This is what I have done so far, but I would love to revert back to "both" when this is fixed.

kaysond commented 11 months ago

I took a quick look at the code, and this should be a super easy fix. Just need to add a Transform() function to the ata attribute here that looks at the string value. If it has 3 parts, then you just grab the last one. smartctl itself already sets -v 188,raw16 for many seagate drives.

https://github.com/AnalogJ/scrutiny/blob/4b1d9dc2d3f5388440a6746c3bdce2b8e2bee91e/webapp/backend/pkg/thresholds/ata_attribute_metadata.go#L662-L669

https://github.com/smartmontools/smartmontools/blob/6b9ed03b9e7c448e41755d484acaabe5db685254/smartmontools/drivedb.h#L4261

itsthejb commented 11 months ago

I’d love to see it too! Since Scrutiny output has convinced me that one of my other drives is definitely expiring, be nice to see green on the seagate which is in fact still ok

kaysond commented 11 months ago

@itsthejb @firasdib @tonyzzz321 - I've fixed this in my fork: https://github.com/kaysond/scrutiny/tree/master

Can you please help test this

itsthejb commented 11 months ago

@kaysond Looking good here! Good job

Screenshot 2023-10-15 at 17 03 46
AnalogJ commented 11 months ago

Hey everyone, thanks for collaborating and figuring this one out. I don't have any seagate drives effected by this issue, so I was depending on the community to help figure out what's going on -- and you delivered!

I'll be merging the PR momentarily

Brandoskey commented 11 months ago

I've updated scrutiny web as well as the collector manual install I have on TrueNAS that has some drives affected by this issue and scrutiny is still reporting the raw value and showing failed for 188. The raw value on two of my drives is 8590065666. 16975535224977369891572488351986

Or are these drives still failing?

zuavra commented 11 months ago

Same here, still getting the error for docker image omnibus
sha256:d45a226d02eb38f82574a552299eb3440c3f398674e92d596e0051e85b2bab48

Screenshot_2023-10-17_17-43-17 png

zuavra commented 11 months ago

Or are these drives still failing?

There doesn't seem to be any transformation done between the raw value and the value marked "Scrutiny".

zuavra commented 11 months ago

With the latest version of image beta:omnibus the attribute shows as warning rather than error, but there still doesn't seem to be any transformation from the raw value, and it still causes the overal drive status to be "failed".

Screenshot_2023-10-17_18-05-06

SaraDark commented 11 months ago

I also have several Seagate drives with a high parameter of 188, this is caused by a problem with the HBA card when building the server, the card was replaced, the problem was eliminated, the drives are functional, but the parameter is still high in the drives that were launched and tested when the problem with the HBA card occurred. example parameter value 17180131333.

It seems that you should simply monitor the rate of increase of this parameter, if the parameter remains stagnant, the problem should be considered solved.

AnalogJ commented 11 months ago

re-opening this since there still seems to be a problem.

One thing to note, beta-omnibus is 12 commits behind main. The "fix" for this issue should be in main already, I'll be updating beta momentarily to alleviate any confusion.

AnalogJ commented 11 months ago

just to confirm, @Brandoskey @zuavra are you running the scrutiny collector with a config file containing:

 - device: /dev/sd[X]
   type: 'sat'
   commands:
      metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
kaysond commented 11 months ago

If you just run smartctl manually on the drive, what does it show for attribute 188? If it shows something like 1 2 30 then it should get parsed correctly. If not, you'll need to add the config file as above.

AnalogJ commented 11 months ago

keep in mind there's a raw value and a raw string value, and they may not be the same 😵‍💫

kaysond commented 11 months ago

keep in mind there's a raw value and a raw string value, and they may not be the same 😵‍💫

Yes. If you just run the command, it shows the string value: image

And the json gives both:

      {
        "id": 188,
        "name": "Command_Timeout",
        "value": 100,
        "worst": 43,
        "thresh": 0,
        "when_failed": "",
        "flags": {
          "value": 50,
          "string": "-O--CK ",
          "prefailure": false,
          "updated_online": true,
          "performance": false,
          "error_rate": false,
          "event_count": true,
          "auto_keep": true
        },
        "raw": {
          "value": 4295032912,
          "string": "1 1 80"
        }
      },
Brandoskey commented 11 months ago

just to confirm, @Brandoskey @zuavra are you running the scrutiny collector with a config file containing:

 - device: /dev/sd[X]
   type: 'sat'
   commands:
      metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'

I do not run with a config file. I just run the command with host-id and api-endpoint params set. This is on TrueNAS so I was attempting to change as little of the underlying system as possible.

I take it a config and the arguments you added are required for the fix to work?

zuavra commented 11 months ago

just to confirm, @Brandoskey @zuavra are you running the scrutiny collector with a config file containing:

 - device: /dev/sd[X]
   type: 'sat'
   commands:
      metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'

Thank you, that's what was missing, I can now see the transformed values.

By the way, since device letters change on reboot... should I add these commands to all drives regardless of make, and rely on the fact that non-Seagate drives don't have attribute 188?

Also, I notice that the attribute is still marked as "warning" (for Seagate drives) even when the raw value is zero. Screenshot_2023-10-17_22-09-55

Brandoskey commented 11 months ago

After adding collector.yaml with metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive' it does indeed work. Is there no way to bake this in though? I have a lot of seagate drives on a lot of machines, this could get very tedious.

I did find that adding the arguments globally to all drives didn't seem to break anything on my non seagate drives.

Edit: after setting alerts back to both, I also see the drives are still marked as failed, basically everything @zuavra reported

kaysond commented 11 months ago

@zuavra - you're seeing it as a warning because you're right on the edge of the arbitrary threshold: https://github.com/AnalogJ/scrutiny/blob/c3a0fb7fb526d3e74218a5b04e9c6685007411bf/webapp/backend/pkg/thresholds/ata_attribute_metadata.go#L696-L708

The problem is that the backblaze data shows a 2% failure rate for drives with 0 command timeouts, but >10% for anything between 1 and 13 billion. I picked 100 arbitrarily because 10% or higher causes scrutiny to consider it as an error:

https://github.com/AnalogJ/scrutiny/blob/c3a0fb7fb526d3e74218a5b04e9c6685007411bf/webapp/backend/pkg/models/measurements/smart_ata_attribute.go#L139-L143

I think backblaze has their data divided the way they do because the high end is so high. The correct thing to do here is either analyze the raw data, or email them to find out exactly how many command timeouts corresponds to 10% failure rate. With that number we can easily update the thresholds.

kaysond commented 11 months ago

After adding collector.yaml with metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive' it does indeed work. Is there no way to bake this in though? I have a lot of seagate drives on a lot of machines, this could get very tedious.

I did find that adding the arguments globally to all drives didn't seem to break anything on my non seagate drives.

Edit: after setting alerts back to both, I also see the drives are still marked as failed, basically everything @zuavra reported

@Brandoskey - I think the right thing to do is raise this as an issue for smartmontools

They already have some code to enable this automatically for seagate drives, but for some reason I guess it's not catching your drives: https://github.com/smartmontools/smartmontools/blob/a03301953f292c54116642c683cd8e4fcd43f0b6/smartmontools/drivedb.h#L4256

If you tell them your drive model and firmware they should be able to add it to the list, then scrutiny can use the latest release of smartmontools when it's added.

zuavra commented 11 months ago

@kaysond Yes it seems that the 1-13G interval is statistically significant and distinct enough from 13G-26G and other slices.

It's all for the best though, I wouldn't have paid attention to 188 if it weren't for Scrutiny!

One more question, can Scrutiny plot 188 on the graph or does it only plot temperature?

pyrodex commented 11 months ago

So I just deployed this via docker across my systems and sure enough 3 of my drives on one system are showing these errors. Are these all failing or is It still part of the bug? I have a few systems with Seagates and the others are fine so wondering if it is my 3 drives.

Pastebin dumps of each below:

https://paste.debian.net/hidden/020d1648/ https://paste.debian.net/hidden/441bee49/ https://paste.debian.net/hidden/e70718aa/

Thanks for the help!

kaysond commented 11 months ago

So I just deployed this via docker across my systems and sure enough 3 of my drives on one system are showing these errors. Are these all failing or is It still part of the bug? I have a few systems with Seagates and the others are fine so wondering if it is my 3 drives.

Pastebin dumps of each below:

https://paste.debian.net/hidden/020d1648/ https://paste.debian.net/hidden/441bee49/ https://paste.debian.net/hidden/e70718aa/

Thanks for the help!

Can you share a screenshot? Are you using the latest image?

kaysond commented 11 months ago

@pyrodex - not showing up. I think you can't do images via email

pyrodex commented 11 months ago

So I just deployed this via docker across my systems and sure enough 3 of my drives on one system are showing these errors. Are these all failing or is It still part of the bug? I have a few systems with Seagates and the others are fine so wondering if it is my 3 drives. Pastebin dumps of each below: https://paste.debian.net/hidden/020d1648/ https://paste.debian.net/hidden/441bee49/ https://paste.debian.net/hidden/e70718aa/ Thanks for the help!

Can you share a screenshot? Are you using the latest image?

I am using the master collector image and here is the screen shot

Screenshot 2023-10-20 at 7 53 26 AM
pyrodex commented 11 months ago

I was able to fix my drives with the collect.yaml passed through to the container with the fixed settings above suggested.

thimplicity commented 11 months ago

I was able to fix my drives with the collect.yaml passed through to the container with the fixed settings above suggested.

Could you share how you did that? I am using scrutiny in the hub-and-spoked model where the hub is in docker and the spokes on different servers. What do I need to do to get this reported properly? I do not have any files on the spokes except the script that runs on the crontab

jerry-yuan commented 11 months ago

I got the same issue on Seagate ST4000VN008 which really scared me.😂 Screenshot_2023-10-23-23-52-46-81_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

I got three red alerts😂

Screenshot_2023-10-23-23-54-03-13_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

Maybe I will try read source code in this weekend.

kaysond commented 11 months ago

I got the same issue on Seagate ST4000VN008 which really scared me.😂 Screenshot_2023-10-23-23-52-46-81_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

I got three red alerts😂

Screenshot_2023-10-23-23-54-03-13_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

Maybe I will try read source code in this weekend.

You just have 1 timeout error that was >5s. If you use the collector yaml mentioned above it will parse correctly

pyrodex commented 11 months ago

I was able to fix my drives with the collect.yaml passed through to the container with the fixed settings above suggested.

Could you share how you did that? I am using scrutiny in the hub-and-spoked model where the hub is in docker and the spokes on different servers. What do I need to do to get this reported properly? I do not have any files on the spokes except the script that runs on the crontab

So I run the collectors as containers also since most of my hosts that aren't the hub are physical and here is that compose:

version: "3.0"
services:
  collector:
    container_name: scrutiny-collector
    image: 'ghcr.io/analogj/scrutiny:master-collector'
    cap_add:
      - SYS_RAWIO
      - SYS_ADMIN
    volumes:
      - '/run/udev:/run/udev:ro'
      - '/apps/scrutiny-collector:/opt/scrutiny/config'
    environment:
      COLLECTOR_API_ENDPOINT: http://<ip>:8080
      COLLECTOR_RUN_STARTUP: true
    restart: unless-stopped

Then I also created the mapped volume to /apps/scrutiny-collector where I placed the collector.yaml since I couldn't find a way to do the customization via the ENV variables of the container.

Here is what mine collector.yaml looks like:

version: 1
host:
  id: "hostname"
devices:
   - device: /dev/sda
     type: 'sat'
     commands:
        metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
   - device: /dev/sdb
     type: 'sat'
     commands:
        metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
   - device: /dev/sdc
     type: 'sat'
     commands:
        metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'
api:
  endpoint: 'http://<ip>:8080'
thimplicity commented 11 months ago

Works! Thanks a lot. I had to remove the disk once from Scrutiny, but now it shows it correctly. Appreciate your help

Brandoskey commented 10 months ago

Since the manual install doesn't persist on Truenas through upgrades I've switched to the truecharts install. I have scrutiny reporting to my main docker instance, however I can't find a way to pass the arguments to report 188 correctly.

I tried mounting the directory my collector.yaml is in to /opt/scrutiny/config but the container refuses to start, also the container is not technically a collector to begin with so I doubt this would work.

Is there a way to get these argument implewmented in the true chart version that I can't figure out?

Barmagler commented 10 months ago

Same here with WDC WUH721414ALE6L4. There is 4 HDD with two different firmware and different result. collector.yaml changed:

devices:
  - device: /dev/sda
    type: 'sat'
    commands:
       metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'    
  - device: /dev/sdb
    type: 'sat'
    commands:
       metrics_smart_args: '-xv 188,raw16 --xall --json -T permissive'    

Screenshot_59 Screenshot_60 Screenshot_61

I have HDSentinel report:

Screenshot_62

unai-ndz commented 9 months ago

I don't think this should be fixed in scrutiny but on smartmontools. If you have a similar issue it should be reported and eventually fixed inside the drivedb.h This can take a while though.

So here is how to change the default configuration for your drive in smartmontools and apply that to scrutiny as well. This means both smartcl -a /dev/sda in your system and scrutiny inside the container will give you correct results.

/etc/smartmontools/smart_drivedb.h

/// Get the default path for this file in your system: smartctl -h | grep 'default is'
// Get device model smartctl -i /dev/sda (Also shows if device in database or not, which can be used to detect if your custom config is working)
// Find your drive entry in the database if any: rg -C 6 '$DRIVE_MODEL' /usr/share/smartmontools/drivedb.h

// If you doubt if a value from a seagate drive is correct compare it with: openSeaChest_SMART -d /dev/sda --smartAttributes analyzed

// First line is the family, second line the model
// In this case is using a regex to match several models but you could replace it with your specific model
// I'm only adding -v 1,7,188. the other configs were in the built-in drivedb
  { "Seagate Exos X16", // tested with ST10000NM001G-2MW103/SN02
      // ST14000NM001G-2KJ103/SN02, ST16000NM001G-2KK103/SN02, ST16000NM001G-2KK103/SN03
    "ST1[0246]000NM00[13][GJ]-.*",
    "", "",
    "-v 1,raw24/raw32,Raw_Read_Error_Rate "
    "-v 7,raw24/raw32,Seek_Error_Rate "
    "-v 18,raw48,Head_Health "
    "-v 188,raw16,Command_Timeout "
    "-v 200,raw48,Pressure_Limit "
    "-v 240,msec24hour32"
  },

Now smartctl -a /dev/sda | grep 188 should interpret the value correctly, in my case:

188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       1 1 4

If you have a Seagate drive you should see three values at the end, indicating the different types of command timeouts.

Now let's fix it in scrutiny (I'm using a podman quadlet file but should be easy to translate t o docker)

# Make scrutiny use you custom drivedb.h
# This can be used to fix drive configs system wide instead of drive by drive
# Get the default path used inside the container
# podman exec -it scrutiny /bin/sh # AFAIK it's the same command on docker (my container is named scrutiny, change it if necessary)
# smartctl -h | grep 'default is'
# /etc/smart_drivedb.h
# Mount the path from your system on the container default path
# Remove the `z,` at the end if you don't use selinux
Volume=/etc/smartmontools/smart_drivedb.h:/etc/smart_drivedb.h:z,ro

If you restart the container now, the current value will be correct but it will still remember the old incorrect values and detect the drive as failed. If you don't want to lose the entire history for your drive it could be possible to edit the influxdb of the container. I didn't bother. Just remove the drive using the webui and restart the container.

ForsakenRei commented 8 months ago

@zuavra - you're seeing it as a warning because you're right on the edge of the arbitrary threshold:

https://github.com/AnalogJ/scrutiny/blob/c3a0fb7fb526d3e74218a5b04e9c6685007411bf/webapp/backend/pkg/thresholds/ata_attribute_metadata.go#L696-L708

The problem is that the backblaze data shows a 2% failure rate for drives with 0 command timeouts, but >10% for anything between 1 and 13 billion. I picked 100 arbitrarily because 10% or higher causes scrutiny to consider it as an error:

https://github.com/AnalogJ/scrutiny/blob/c3a0fb7fb526d3e74218a5b04e9c6685007411bf/webapp/backend/pkg/models/measurements/smart_ata_attribute.go#L139-L143

I think backblaze has their data divided the way they do because the high end is so high. The correct thing to do here is either analyze the raw data, or email them to find out exactly how many command timeouts corresponds to 10% failure rate. With that number we can easily update the thresholds.

So for this case as long as my raw value didn't change and stays 0(or lower than 10) it should be fine? I have 7 different drives which all show the same warning.

OddMagnet commented 7 months ago

I got the same Problem:

scrutiny

Based on this it's just 2 errors.

Additionally, smartctl -xv 188,raw16 /dev/sdc | grep 188 gives me: 188 Command_Timeout -O--CK 100 099 000 - 2 2 2

Are there any updates to this, or should I go with @unai-ndz 's solution?

kaysond commented 7 months ago

I got the same Problem: scrutiny

Based on this it's just 2 errors.

Additionally, smartctl -xv 188,raw16 /dev/sdc | grep 188 gives me: 188 Command_Timeout -O--CK 100 099 000 - 2 2 2

Are there any updates to this, or should I go with @unai-ndz 's solution?

I'd say this is working as intended. If you want to have the formatting applied automatically, you should submit an issue/pull request to the smartmontools repo.