dak180 / FreeNAS-Report

SMART & ZPool Status Report for FreeNAS/TrueNAS
GNU General Public License v3.0
38 stars 8 forks source link

Script Ignores SAS Drives #1

Closed aardvarkl closed 2 years ago

aardvarkl commented 2 years ago

I just noticed that this script, whilst looking at NVMe drives completely ignores SAS drives.

Its not that the script is faulty - its that it does not recognise / is not coded for SAS drives If it helps I have a single SAS drive in my FreeNAS which responds as below

########## SMART status for SAS drive /dev/da4 Z1Z5JDWE0000C507EFJZ (ST3000NM0023) ##########

SMART Health Status: OK

Current Drive Temperature:     40 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 50320:03
Manufactured in week 36 of year 2014
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  322
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  2374
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3402958289        0         0  3402958289          0      34903.312           0
write:         0        0         0         0          0      22596.708           0
verify:     4981        0         0      4981          0          0.000           0

Non-medium error count:        6

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
Background short  Completed                   -   50313                 - [-   -    -]
dak180 commented 2 years ago

I do not have any to test with so I will need some info from you: smartctl -AHijl selftest --log="devstat" and smartctl -AHil selftest --log="devstat" (one set per model of drive), feel to replace serials with nonsense. We can start with that.

dak180 commented 2 years ago

In order to support this I will need more examples of different models of drives.

Please upload as separate files: smartctl -AHijl selftest --log="devstat" «Drive Id» > «filename» smartctl -AHil selftest --log="devstat" «Drive Id» > «filename»

dak180 commented 2 years ago

@tamahau30494 and @aardvarkl across all these drives the common info set is rather sparse, it looks like I could get the temp, is smart is passing, time in use and if the defect list is growing. That is it though; is that enough info to actually be useful to you? I ask because none of those things are on the list of things I have used to identify a failing drive early.

aardvarkl commented 2 years ago

Sorry I haven't responded - turns out my account is linked to an email I very rarely use nowadays (now corrected). Do you still need anything from me?

aardvarkl commented 2 years ago

sda1.txt sda2.txt sdf1.txt sdf2.txt

Hope I got that right

mth309 commented 2 years ago

Hey @dak180 thanks for incorporating my parsing logic suggestions for SAS drives! I just tested out the first pass you sent on both my servers with mixed SATA/SAS and it's almost perfect!

There's one major issue that prevents it from detecting the SAS drives as having SMART enabled, because while smartctl prints "support is: Enabled" for ATA drives it prints "support is: Enabled" for SCSI. So I modified the code at line 1509 as follows:

readarray -t "drives" <<< "$(for drive in $(sysctl -n kern.disks | sed -e 's:nvd:nvme:g'); do
        if smartctl --json=u -i "/dev/${drive}" | grep -q "SMART support is: Enabled"; then
                printf "%s " "${drive}"
        elif smartctl --json=u -i "/dev/${drive}" | grep -q "SMART support is:     Enabled"; then  #added this
                printf "%s " "${drive}"                                                            #and this
        elif echo "${drive}" | grep -q "nvme"; then
                printf "%s " "${drive}"
        fi
done | awk '{for (i=NF; i!=0 ; i--) print $i }')"

You could alternatively trim the whitespace prior to grepping if that's more elegant, this was the quick and dirty fix. After that it detected the SAS drives and printed a nice summary table with almost all of the data correct! A couple other minor fixes:

  1. On line 1343 it's adding a '%' after the verify errors, when this should be just an integer like the read and write error counts.
  2. The code you have for parsing json to find self test records will not work, due to a bug in smartctl not outputting this data in json format. I filed a bug report (feature request?) yesterday as we discussed on the TrueNAS forum https://www.smartmontools.org/ticket/1606 so if they decide to fix that your script may have to be updated to match. For example you're currently looking for a json field called 'ata_smart_self_test_log', but the smartmontools folks might name that something different like 'scsi_smart_self_test_log' if they add it to the json. We won't know until they add it. Alternatively, you can parse the non-json data today from 'smartctl -l selftest', but that's up to you if you want to add that or wait for the json fix.
  3. You're original 'SATA' summary table is titled "HDD SMART Status...", as compared to the other table 'SSD Smart Status...'. I would rename it 'SATA' instead of 'HDD' now that there's a table for SAS. Or else maybe 'SATA HDD', 'SATA SSD', and 'SAS HDD'? Up to you, it just seemed a bit odd as-is one named HDD and one named SAS :) Of note, if someone is using SAS SSDs they would parse identically to the HDDs, so you could keep all SAS drives, HDD and SSD, in a single table. Maybe that should influence the table title after all...

That's all I could find to correct with the current implementation. It looks great! Thanks!

If you're interested in future enhancements, there are a couple other values that could be parsed if you like, that I didn't mention on the TrueNAS forum thread because they're currently not output in the json format. These attributes are similar to others you parse for SATA, but they're not as critical as what you're now grabbing, and if smartmontools doesn't add them to the json you'll have to grab them using the standard output, so again up to you if you think it's worth the effort.

  1. Non-medium errors is found in the error log, same place you got read, write and verify uncorrected: 'smartctl -l error'. It's the last line of the output 'Non-medium error count: 0' This is similar to UDMA or CRC errors on SATA drives, and can indicate something wrong with the controller or cables or internal cache, rather than the magnetic platters. If smartmontools adds this to the json output I would definitely recommend grabbing it, otherwise it's no big loss.
  2. Start/stop count and load/unload count are found in the output of 'smartctl -A' I'll paste what the full block looks like below, but currently only the first and last lines (power on time and grown defects) are output in the json format. We'd have to wait for smartmontools to add the others
Accumulated power on time, hours:minutes 12553:00
Manufactured in week 50 of year 2012
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  73
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  3038
Elements in grown defect list: 0

There used to be a bug (using the term loosely, more like an interaction) in the Linux kernel that would start/stop or load/unload some drives (WD greens were notoriously the worst) every 60-90 seconds. That could blow through the whole lifetime ratings in a few months, and it was killing drives. I haven't seen this problem in probably 10 years, and on a well behaved system these values are pretty meaningless, but in my own tools (which I started well more than 10 years ago :)) I display this data, just in case. If they add it to the json you might as well add it to the summary table, but otherwise it's skippable.

Thanks again for working with me to add SAS parsing to the script!

dak180 commented 2 years ago

There's one major issue that prevents it from detecting the SAS drives as having SMART enabled, because while smartctl prints "support is: Enabled" for ATA drives it prints "support is: Enabled" for SCSI. So I modified the code at line 1509 as follows:

@mth309 Does e43e83c7df152e22e9ac9400c123849e4ca4da06 fix this for you?

dak180 commented 2 years ago

If you're interested in future enhancements, there are a couple other values that could be parsed if you like, that I didn't mention on the TrueNAS forum thread because they're currently not output in the json format. These attributes are similar to others you parse for SATA, but they're not as critical as what you're now grabbing, and if smartmontools doesn't add them to the json you'll have to grab them using the standard output, so again up to you if you think it's worth the effort.

@mth309 those would be things I would like to parse but it would be much easier in json format

mth309 commented 2 years ago

I added a comment with a tiny tweak to that change you asked me to look at.

I also agree things would be easier if you can keep everything in the world of json. Hopefully smartmontools will be updated to make that possible, but it could be a while before that happens and a new version gets into FreeBSD/TrueNAS. At least we've captured all the most important stuff as-is!

I'm going to be on a business trip for the next week without access to a PC, but when I get back I'll see how much of a pain it is to pull the non-json parsing out of my remake of that older script and shoehorn it into your SASSummary function. Primarily because I want to switch from the outdated script over to yours on my servers, but I also want to keep those outputs if possible :) If you like the code when I'm done you can have it, but if you'd rather wait on a json fix that's great too. Thanks again!

mth309 commented 2 years ago

Ah! I apologize I missed one other minor issue. The exact same "is SMART enabled" test occurs on line 1645, to determine which drives to print detailed SMART data for after the summary tables. Right now it skips the SAS drives as it used to do for the tables, but if you update the grep expression to match the new line 1509 (grep without -q piped into grep with -q) it works perfectly.

dak180 commented 2 years ago

I'm going to be on a business trip for the next week without access to a PC, but when I get back I'll see how much of a pain it is to pull the non-json parsing out of my remake of that older script and shoehorn it into your SASSummary function.

@mth309 Let me know if there is anything else you can see before you leave as an easy fix I can do: https://github.com/dak180/FreeNAS-Report/compare/topic/refactor...dak180:topic/sas?expand=1

mth309 commented 2 years ago

OK so I know this is an absolutely terrible way to convey information to you, but I took your latest version with all the changes to date (from the link in the previous comment) and pulled in the parsing from my non-json script to grab non-medium errors, accumulated start/stops and load/unloads and also last test type and hours. I also brought in a workaround for some Hitachi drives I have with janky firmware that don't support self-testing but report a mucked up selftest log table in smartctl anyways.

Before I get to any of that though, I found another issue in your HDDStatus function where it only checks rotation rate to determine that a device is an HDD, thus it ends up duplicating all the SAS HDD devices from the SAS list. I assume that wasn't intentional so I added another condition there (basically copied the one in your SASSummary function and NOT'd it) to ignore SCSI drives in the HDDSummary function.

That clued me in to one other weird issue. Of the 23 drives in my server, I had 5 (all SATA HDDs) that weren't showing up. But some of my other SATA HDD and SSD were showing fine. This was masked before by all the duplicate SAS drives that at a quick glance made me think the output was correct. All 5 missing drives were very old (~2008-2012) Western Digital and on closer inspection I found that none of them report a rotation rate, so they fail the '!=0' test at the beginning of the HDDSummary function. I added a catch at the beginning of that function where if rotation rate is null it sets it to "N/A", so at least it will pass the "not 0" test.

All in all it was ~25 lines of code to do everything above and didn't take too long to code and debug since I've done most of this before (I just usually work in Python). I'm sure some of the changes are not done the way you'd prefer (sorry, I used awk for the non-json :)) but it works. I tried to stay as close to your coding/naming conventions as I could everywhere else, besides the awks.

In order to show the changes I added a '#_NEW' comment tag to every line I touched, which I can grep for and show line numbers. I also showed one line above and below each change for context. So below is the output of changes if you'd like to tweak or incorporate any of it. If you don't want to incorporate the non-json stuff in your official release that's fine, but like I said I wanted that extra info for my own servers so I figured I'd share it and give you the choice. The other couple bug fixes for duplicate SAS and no-rpm-reporting I would recommend you pulling in, since it's like 2-3 lines of code. Thanks again for all your work incorporating my previous suggestions!

961-                local rotTst="$(echo "${hddInfoSmrt}" | jq -Mre '.rotation_rate | values')"
962:                local scsiTst="$(echo "${hddInfoSmrt}" | jq -Mre '.device.type | values')"    #_NEW
963:                if [ -z "${rotTst}" ]; then   #_NEW  FIxes issue with drives that don't report rpm
964:                   rotTst="N/A"               #_NEW
965:                fi                            #_NEW
966:                if [ ! "${rotTst:="0"}" = "0" ] && [ ! "${scsiTst}" = "scsi" ]; then          #_NEW fixes issue with duplicate SAS devices
967-                        # For each drive detected, run "smartctl -AHijl selftest" and parse its output.
--
1171-                echo '<th style="text-align:center; width:120px; height:60px; border:1px solid black; border-collapse:collapse; font-family:courier;">Power-On<br>Time<br>('"${powerTimeFormat}"')</th>'
1172:                echo '<th style="text-align:center; width:80px; height:60px; border:1px solid black; border-collapse:collapse; font-family:courier;">Start<br>Stop<br>Cycles</th>'        #_NEW
1173:                echo '<th style="text-align:center; width:80px; height:60px; border:1px solid black; border-collapse:collapse; font-family:courier;">Load<br>Unload<br>Cycles</th>'       #_NEW
1174-                echo '<th style="text-align:center; width:80px; height:60px; border:1px solid black; border-collapse:collapse; font-family:courier;">Grown<br>Defect<br>List</th>'
--
1177-                echo '<th style="text-align:center; width:80px; height:60px; border:1px solid black; border-collapse:collapse; font-family:courier;">Uncorrected<br>Verify<br>Errors</th>'
1178:                echo '<th style="text-align:center; width:80px; height:60px; border:1px solid black; border-collapse:collapse; font-family:courier;">Non-medium<br>Errors</th>'           #_NEW
1179-                echo '<th style="text-align:center; width:100px; height:60px; border:1px solid black; border-collapse:collapse; font-family:courier;">Last Test<br>Age (days)</th>'
--
1187-                local sasInfoSmrt="$(smartctl -AHijl selftest "/dev/${drive}")"
1188:                local nonJsonSasInfoSmrt="$(smartctl -Al error -l selftest "/dev/${drive}")"     #_NEW
1189-                local rotTst="$(echo "${sasInfoSmrt}" | jq -Mre '.device.type | values')"
--
1200-                        local lastTestType="$(echo "${sasInfoSmrt}" | jq -Mre '.ata_smart_self_test_log.standard.table[0].type.string | values')"
1201:                        lastTestHours="$(echo "${nonJsonSasInfoSmrt}" | grep "# 1" | awk '{print $7}')"        #_NEW
1202:                        lastTestType="$(echo "${nonJsonSasInfoSmrt}" | grep "# 1" | awk '{print $3" "$4}')"    #_NEW
1203:                        # Workaround for some drives that do not support self testing but still report a garbage self test log      #_NEW
1204:                        # Set last test type to 'N/A' and last test hours to null "" in this case                                   #_NEW
1205:                        if [ "${lastTestType}" == "Default Self" ]; then       #_NEW
1206:                            lastTestType="N/A"                                 #_NEW
1207:                            lastTestHours=""                                   #_NEW
1208:                        fi                                                     #_NEW
1209-
--
1234-                        local uncorrectedVerifyErrors="$(echo "${sasInfoSmrt}" | jq -Mre '.verify.total_uncorrected_errors | values')"
1235:                        local nonMediumErrors="$(echo "${nonJsonSasInfoSmrt}" | grep "Non-medium" | awk '{print $4}')"                      #_NEW
1236:                        local accumStartStopCycles="$(echo "${nonJsonSasInfoSmrt}" | grep "Accumulated start-stop" | awk '{print $4}')"     #_NEW
1237:                        local accumLoadUnloadCycles="$(echo "${nonJsonSasInfoSmrt}" | grep "Accumulated load-unload" | awk '{print $4}')"   #_NEW
1238-
1239-                        # Get more useful times from hours
1240:                        local testAge=""                #_NEW (had to reset the variable to null between disks or it repeated the last good value)
1241-                        if [ ! -z "${lastTestHours}" ]; then
--
1353-                                echo '<td style="text-align:center; height:25px; border:1px solid black; border-collapse:collapse; font-family:courier;">'"${onTime}"'</td>'
1354:                                echo '<td style="text-align:center; height:25px; border:1px solid black; border-collapse:collapse; font-family:courier;">'"${accumStartStopCycles}"'</td>'   #_NEW
1355:                                echo '<td style="text-align:center; height:25px; border:1px solid black; border-collapse:collapse; font-family:courier;">'"${accumLoadUnloadCycles}"'</td>'  #_NEW
1356-                                echo '<td style="text-align:center; background-color:'"${scsiGrownDefectListColor}"'; height:25px; border:1px solid black; border-collapse:collapse; font-family:courier;">'"${scsiGrownDefectList}"'</td>'
--
1359-                                echo '<td style="text-align:center; background-color:'"${uncorrectedVerifyErrorsColor}"'; height:25px; border:1px solid black; border-collapse:collapse; font-family:courier;">'"${uncorrectedVerifyErrors}"'</td>'
1360:                                echo '<td style="text-align:center; height:25px; border:1px solid black; border-collapse:collapse; font-family:courier;">'"${nonMediumErrors}"'</td>'        #_NEW
1361-                                echo '<td style="text-align:center; background-color:'"${testAgeColor}"'; height:25px; border:1px solid black; border-collapse:collapse; font-family:courier;">'"${testAge:-"N/A"}"'</td>'
dak180 commented 2 years ago

OK so I know this is an absolutely terrible way to convey information to you

@mth309 Why not make a fork of the repo and then you can put in your edits on the website if you are not comfortable using git locally.

dak180 commented 2 years ago

All 5 missing drives were very old (~2008-2012) Western Digital and on closer inspection I found that none of them report a rotation rate, so they fail the '!=0' test at the beginning of the HDDSummary function. I added a catch at the beginning of that function where if rotation rate is null it sets it to "N/A", so at least it will pass the "not 0" test.

@mth309 could you post the relevant json and non json outputs for these drives in a separate (new) ticket?

mth309 commented 2 years ago

Yeah I apologize I knew what I sent last night wouldn't be ideal. I've never used GitHub (forked a repo, etc) before, so I'd have to Google how to set that up and push you changes, etc. I just didn't have time last night to look into any of it. I'm on my way to the airport now, down to just a phone for a week, and that only at night once I get where I'm going. I might be able to vpn into my server while I'm away and get those old WD outputs, but most likely that won't happen until Friday or Saturday when I'm back. I'll look into how to use git/GitHub too so I can contribute code properly in the future, I just wanted to send you what I had before I fell off the earth for a week. Thanks!