dak180 / FreeNAS-Report

SMART & ZPool Status Report for FreeNAS/TrueNAS
GNU General Public License v3.0
38 stars 8 forks source link

Cleaning up Seagate weirdness #31

Open Inevitable opened 7 months ago

Inevitable commented 7 months ago

So, I've been trying to figure out why the report consistently shows some rather alarming numbers for Seek Error Health. The drives I'm testing against do not actually list any errors at all, and yet both the FARM stats from Seagate, as well as the non-raw values reported in section 7 return extreme values as well. As an example, this is a section of the report on one of the drives in question:

Device Model Serial
Number
RPM Capacity SMART
Status
Temp Power-On
Time
(ymdh)
Start
Stop
Count
Spin
Retry
Count
Realloc
Sectors
Realloc
Events
Current
Pending
Sectors
Offline
Uncorrectable
Sectors
CRC
Errors
Seek
Error
Health
Last Test
Age (days)
Last Test
Type
/dev/sdg ST18000NM000J-2TV103 ZR5AR716 7200 [18.0 TB] PASSED 34°C 1y 0m 14d 5h 7 0 0 0 0 0 85% 0 Extended offline

So what the heck is going on there? Digging in manually leads to some interesting conclusions. First, here's the raw SMART output on that drive, both standard ata_smart and the somewhat experimental still --log farm output from smartmontools 7.4 (Yes, this annoyed me enough that I backported it to SCALE).

.ata_smart_attributes.table[]

{
  "id": 7,
  "name": "Seek_Error_Rate",
  "value": 85,
  "worst": 60,
  "thresh": 45,
  "when_failed": "",
  "flags": {
    "value": 15,
    "string": "POSR-- ",
    "prefailure": true,
    "updated_online": true,
    "performance": true,
    "error_rate": true,
    "event_count": false,
    "auto_keep": false
  },
  "raw": {
    "value": 343129453,
    "string": "343129453"
  }

and

.seagate_farm_log.page_5_reliability_statistics (truncated)`

},
    "page_5_reliability_statistics": {
      "attr_error_rate_raw": 12821624,
      "error_rate_normalized": 71,
      "error_rate_worst": 64,
      "attr_seek_error_rate_raw": 343129453,
      "seek_error_rate_normalized": 85,
      "seek_error_rate_worst": 60,
      "high_priority_unload_events": 1,
      "helium_presure_trip": 0,
....
}

The interesting part is when we actually 'decode' the RAW value. Seagate's Seek Error Rate attribute consists of two parts -- a 16-bit count of seek errors in the uppermost 4 nibbles, and a 32-bit count of seeks in the lowermost 8 nibbles. So in order to get useable data, we need to:

convert to hex: 0000 1473 BD6D Split the hex for the 16 and 32 bits: 0000 and 1473BD6D

So, right away we can see that in reality, this drive has had literally zero seek errors recorded! This is important, because as you can see in the output above, the normalized values are being reported as 85 in both possible outputs.

Question becomes yeah but why? So it turns out that the normalized values represent a logarithmic error rate. Doing the math:

What I think many assume is that the calculation should be something like: $\frac{0000}{1473BD6D}=0$

The issue is the normalization used, as mentioned, is logarithmic, and as such the "real" equation is indeterminate: $-10\log\frac{0000}{1473BD6D}$

So what they apparently do instead is: $-10\log\frac{0001}{1473BD6D}=85$

And so do we get the mildly panic inducing if you're not aware, report (it even highlights it in red! woo!). I'm not 100% sure the best way to handle this to be honest; might just have deal with self-maintaining an edited version of the script specifically for the RAW values that use this weird encoding or something.

dak180 commented 7 months ago

See #292 and more recently #1471. In the mean time I do not have any seagate drives to test with so if you (or anyone else for that matter) can come up with a test to reliably decide when to add the appropriate -v flags I would take a patch (or create one from an appropriate test).

dak180 commented 4 months ago

@Inevitable were you ever able to test the modifications proposed in #1471?

JoeSchmuck commented 4 months ago

I cannot answer the question you have above but to put your mind at ease, here is the simple way to find out if that Seagate value means anything is wrong. Divide the raw number by 4,294,967,295 which is hex 'FFFFFFFF'. Any whole number remainder is the count of actual read errors, or seek errors. you will note, if you are paying attention, these values will go up and down. One day it is 34262581 and the next day it is 12. There is more to this however when looking for real tangible data, this is how it's done. This is Seagate unique. I do not know of any other drive that does this, but I don't know everything.

dak180 commented 1 month ago

@Inevitable Were you ever able to try smartctl -AxHij --log="xerror,error" --log="xselftest,selftest" --log="devstat" --log="farm" --log="envrep" --log="defects" --log="zdevstat" --log="genstats" --log="ssd" --log="background" -v '7,raw24/raw32:543210' "/dev/${drive}" to see if that got you sensible output for Seek_Error_Rate?

Inevitable commented 1 month ago

Ah yeah, I moved to a different method of tracking my HDD health, but did go ahead and run this just to test, and no it still outputs the RAW values. In the case of the test I ran:

       "id": 7,
        "name": "Seek_Error_Rate",
        "value": 88,
        "worst": 60,
        "thresh": 45,
        "when_failed": "",
        "flags": {
          "value": 15,
          "string": "POSR-- ",
          "prefailure": true,
          "updated_online": true,
          "performance": true,
          "error_rate": true,
          "event_count": false,
          "auto_keep": false
        },