Feature Request - Storage Monitoring

Napsty / check_esxi_hardware

Monitoring Plugin to check the hardware of VMware ESXi servers.

https://www.claudiokuenzler.com/monitoring-plugins/check_esxi_hardware.php

70 stars 18 forks source link

Feature Request - Storage Monitoring #50

Closed Optimaximal closed 3 years ago

Optimaximal commented 3 years ago

Is it possible to add storage monitoring to the plugin (including both physical & virtual disks and/or data stores) and also allow the exclusion of storage monitoring (with something like -nodisk)?

The plugin currently reports actual/predicted drive failures in the warning message, but I currently use the plugin with Icinga2 and I have all the commands configured as separate services in a set, only each service is now failing because of a drive failure in the RAID.

Edit - for reference, this is on Dell PowerEdge hardware.

Napsty commented 3 years ago

add storage monitoring to the plugin

Do you mean VMFS datastores? No, this is not possible as this is not a physical element and therefore won't appear in the CIM elements. Physical drives however are usually in the CIM element list, as long as the server hardware supports it. But you see this from your output already.

exclusion of storage monitoring (with something like -nodisk)

Yes, you can use the -i/--ignore parameter together with -r/--regex parameter to ignore all drives. Something like this should do it:

./check_esxi_hardware.py -H esxiserver -U root -P pass -i "Drive,Disk" -r

Optimaximal commented 3 years ago

Sorry, mentioning Datastores was probably a red herring - I was referring to the Physical Disks visible in the iDRAC and the associated Virtual Disk(s) created on the onboard PERC, which are mounted as local datastores in ESXi.

Your plugin does not seem to have the option to query the storage exclusively (by excluding all other options), other than the general alarm when there's a warning. There also doesn't seem to be any way of collecting perf history of storage for the same reason.

I will using your ignore commands to suppress the warnings on the individual cloned elements, but unless I'm missing something obvious (or you're suggesting to use -i and ignore everything except Drive and Disk, there's no current way I can set up a service clone that is just geared towards monitoring the storage elements.

Napsty commented 3 years ago

I was referring to the Physical Disks

Yes, physical disks/drives are monitored, as long as they appear in the elements sent by the CIM server (use verbose mode to see the list of CIM elements). If the physical drives don't show up in the list, then you might need to install additional VIBs from the hardware vendor (Dell OpenManage Offline Bundle and VIB for ESXi).

query the storage exclusively

That's right, the plugin does check all the CIM elements and using the -i list, you can define which elements to exclude from the check.

set up a service clone that is just geared towards monitoring the storage elements

Although using the ignore list to achieve this, I don't know what you are trying to achieve with this? Why not simply check all the cim elements (hardware parts) and get alerted if one element fails? The plugin notifies what kind of element/hardware failed. Maybe I haven't seen such a practical use case before...

Optimaximal commented 3 years ago

It could be a force of habit of my wanting to know granular information + learning Icinga2 and pushing it as hard as possible.

I've run the verbose output and, yes, the drive information is clearly there - what was the reasoning behind not producing perf data for the drive that would justify a separate exclusion etc.

Obviously it's your plugin = your choice. Maybe I need to learn Python and fork it 😄

Napsty commented 3 years ago

what was the reasoning behind not producing perf data for the drive

Because the drives don't have any perf data on a CIM level. They only show their current status. You could only get performance data such as I/O from the OS (ESXi).

So if I understand you correctly, your feature request would be an exclusive parameter to only monitor specific elements (the opposite of the ignore parameter)? Is that right? (even though I still don't see what's there to gain defining multiple service checks)

Optimaximal commented 3 years ago

Yes, I've just reviewed the output and see that data is only what is displayed, which is annoying.

You'd imagine Dell exposing something like the capacity metrics & more SMART information would be sensible, but c'est la vie... I'm not sure what would be required to install additional VIBs - is that done by installing them onto ESXi or is this something installed on the server executing the script, like SNMP MIBs (in this case, Icinga2 running on Ubuntu 18.04)?

I guess the feature request would be an optional --no-disk parameter that behaves the same way as the other values, effectively ignoring disk-related items. Perf data would be an OK, WARN (for Predicted Failures) or CRITICAL result from each disk.

Optimaximal commented 3 years ago

I'm going to close this issue as I've realised that the iDRAC monitoring plugin can grab all the information from the server via a more direct means.

Thanks for your work anyway 😄