daviswr / ZenPacks.daviswr.SMART

Storage device health monitoring for Zenoss
MIT License
0 stars 0 forks source link

Failure to parse limited output format #7

Closed sempervictus closed 2 years ago

sempervictus commented 3 years ago

Some drives dont like to report full data -iAH or the like which causes the zenpack to fail in acquiring output from something like:

# smartctl -iAH /dev/sdc
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.35.4.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               SEAGATE 
Product:              ST3000NM0025    
Revision:             N004
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
Logical Unit id:      0x5000c500a7xxxxxx
Serial number:        ZC19069A0000XXXXXXXX
Device type:          disk
Transport protocol:   SAS
Local Time is:        Sat Nov 13 01:10:10 2021 EST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature:     31 C
Drive Trip Temperature:        60 C
Manufactured in week 44 of year 2018
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  70
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  148969
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 1609679870
  Blocks received from initiator = 2639123499
  Blocks read from cache and sent to initiator = 2071329652
  Number of read and write commands whose size <= segment size = 104335400
  Number of read and write commands whose size > segment size = 4590099
Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 24382.08
  number of minutes until next internal SMART test = 52
sempervictus commented 3 years ago

The HBA here is a very common 3008:

03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)

in a supermicro chassis. So i think its the drive firmware producing that output, not the controller curtailing it.

daviswr commented 3 years ago

Hm, there might not be a ton to do about that if smartctl doesn't provide much, but those are a few more colon-delimited values I can have the parser look for and at least have it collect enabled status, health status, and temperature correctly. Seems like a rated lifetime calculation could be done, too.

I wonder if "Elements in grown defect list" is akin to reallocated sectors?

Didn't realize the version of smartmontools on EL6 were so long in the tooth.

sempervictus commented 3 years ago

I dont think its the version of the tools doing that, i've seen similar on some Arch Linux systems too (and we run tip for most things).

sempervictus commented 3 years ago

Confirm its not the tool, its the disks - here's one with good output and bad output in the same chassis: image

sempervictus commented 3 years ago

Additionally, qemu disks produce a limited output which we probably want to "handle quietly" since its a logical absurdity at this point:

# smartctl -iAH /dev/sdb
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.76] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               QEMU
Product:              QEMU HARDDISK
Revision:             2.5+
Compliance:           SPC-3
User Capacity:        1,099,511,627,776 bytes [1.09 TB]
Logical block size:   512 bytes
LU is thin provisioned, LBPRZ=0
Serial number:        4d7e0d8b-ee66-4aef-ac60-bc52cd560403
Device type:          disk
Local Time is:        Sun Nov 14 06:57:22 2021 UTC
SMART support is:     Unavailable - device lacks SMART capability.

=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C
daviswr commented 3 years ago

Pull the latest and see if anything's better. Your smartctl examples have been really helpful since I don't have any SCSI/SAS gear around. Also, a friend has sent me a bunch of output from his various servers with SAS and NVMe that I'll be looking over.

sempervictus commented 3 years ago

Pulled and modeling - seems that everything is reading single-digit overall-health values now, but then again a lot of these systems are heavily used and not exactly brand-new.

sempervictus commented 3 years ago

Also this is now happening: image

daviswr commented 3 years ago

The name change is expected, including the -d field reported by --scan seemed to be the best way keep indexed devices unique.

If the health score values being graphed don't appear to match what smartctl's saying, I'd definitely like to know.

sempervictus commented 3 years ago

Will keep an eye on those as well. I think for the /dev/XdY disks, the -d auto can be erased. For the more complex ones, definitely want the param in there.

daviswr commented 3 years ago

-d auto will now be hidden from the device title if present - 8d35ae409c564d4da5de3c28b852f4d04b68c235

sempervictus commented 2 years ago

I think we're good here - output is stable, and the remaining output issues can be tracked in #6