bb-Ricardo / check_redfish

A monitoring/inventory plugin to check components and health status of systems which support Redfish. It will also create a inventory of all components of a system.
MIT License
110 stars 30 forks source link

First output line order is random #115

Closed Kegeruneku closed 6 months ago

Kegeruneku commented 1 year ago

Hello,

As far as I understand, the plugin output order is grouped by status (CRITICAL/WARNING/UNKNOWN/OK) but in the "same group", output order is random.

The problem is, for quite a lot of Nagios-compatible interfaces (Nagios, Icinga web 1, Thruk...), the "main" output is the first line and visible as plugin output, and you may select the individual service (or hover on it for a full output tooltip)

I suppose, especially comparing with other plugins, that some status queries outputs should be priorized in the output.

I feel that "--info" output should be enabled by default and the first output line, which would be an easy fix for this (unless overriden by a != "OK" status with higher priority)

What do you think ? :-)

Have a nice day !

bb-Ricardo commented 1 year ago

Hi,

I'm aware of this behaviour. Do you have an example?

The plugin orders the output lines after their severity. CRITICAL alarms will always be at the top, WARNINGS after and OK below WARNING and CRITICAL lines.

can you describe your use case please?

Kegeruneku commented 12 months ago

Hello :-)

I have no issue with the output lines beeing ordered by their "importance" level, my problem is with the fact that when there is no issue with the server, depending on the selected output specifiers, you get a random status on the first output line (which is often the one that is displayed on monitoring software unless you as for service status detail).

Example:

# ./check_redfish.py -H XXXXXX -u ADMIN -p XXXXXX --info --storage
[OK]: All storage controllers (4), volumes (25) and disk drives (34) are in good condition
[OK]: INFO: Supermicro SSG-620P-E1CR24H (CPU: 2, MEM: 256GB) - BIOS: 1.4 - Serial: XXXXXX - Power: On - Name: NOT SET

In this case, I would expect the INFO line to be the first one, as in the monitoring this is the one I want as status overview, like:

image image

and not something like:

image image

I feel that for most users, having a "standard" output if everything's OK makes having a quick visual glance at your monitoring dashboard much easier since outliers are visible immediately.

To me, a simple solution would be that "--info" is enabled by default unless specified otherwise and the first output line, so if everything looks OK you always get as first output something like:

[OK]: INFO: Dell Inc. PowerEdge R740xd2 (CPU: 2, MEM: 256GB) - BIOS: 2.5.4 - Serial: XXXXXXXXXX - ServiceTag: XXXXXXXX - Power: On - Name: NOT SET - 47 health sensors are in 'OK' state

What do you think ?

bb-Ricardo commented 12 months ago

Hi,

First of all I would recommend to separate each request into a different service. You combine more then one request into a single service.

What you could try is move up the info query in the plugin. This way it should always show up first if you combine requests like --info --storage.

https://github.com/bb-Ricardo/check_redfish/blob/d5c4ee96d24266c6b172c31758682c187a2be8ad/check_redfish.py#L167-L178

Try to move the info above all the other ones and try to run the script again.

Kegeruneku commented 11 months ago

Hmmm...

--- a/check_redfish.py
+++ b/check_redfish.py
@@ -162,6 +162,7 @@ if __name__ == "__main__":
     # get basic information
     plugin.rf.determine_vendor()

+    if any(x in args.requested_query for x in ['info', 'all']):     get_system_info()
     if any(x in args.requested_query for x in ['power', 'all']):    get_chassi_data(PowerSupply)
     if any(x in args.requested_query for x in ['temp', 'all']):     get_chassi_data(Temperature)
     if any(x in args.requested_query for x in ['fan', 'all']):      get_chassi_data(Fan)
@@ -170,7 +171,6 @@ if __name__ == "__main__":
     if any(x in args.requested_query for x in ['nic', 'all']):      get_network_interfaces()
     if any(x in args.requested_query for x in ['storage', 'all']):  get_storage()
     if any(x in args.requested_query for x in ['bmc', 'all']):      get_bmc_info()
-    if any(x in args.requested_query for x in ['info', 'all']):     get_system_info()
     if any(x in args.requested_query for x in ['firmware', 'all']): get_firmware_info()
     if any(x in args.requested_query for x in ['mel', 'all']):      get_event_log("Manager")
     if any(x in args.requested_query for x in ['sel', 'all']):      get_event_log("System")

Seems not to work (same output ordering), I suspect the output order depends more on the stacking of output messages in the PluginData class, right ?

bb-Ricardo commented 11 months ago

Will have to check once I'm back.

bb-Ricardo commented 6 months ago

Hi,

Sorry for the long wait. Finally got around and added a fixed order of command output.

Would you be able the checkout next-release branch and see if this fixes the issue with the order?

Thank you

Kegeruneku commented 6 months ago

Hello @bb-Ricardo ! Yep, that seems to do the trick, you're awesome !

Thank you :-)

Kegeruneku commented 6 months ago

I'll let you close the issue if you do not need me to test further, else just tell me ! (tested good on both Supermicro and Dell "recent" boxes, works as expected in OK, WARNING and CRITICAL cases)

Dell, faulty RAM:

# ./check_redfish.py -H XXXXXX -u root -p XXXXXX --info --storage --mem
[WARNING]: INFO: Dell Inc. PowerEdge R440 (CPU: 2, MEM: 512GB) - BIOS: 2.10.2 - Serial: XXXXXXXXXX - ServiceTag: XXXXXXXX - Power: On - Name: xxxxxxxxx - 1 health sensor in 'WARNING' state, 50 health sensors are in 'OK' state
[WARNING]: Sensor "DIMM SLOT A3": Degraded/Warning (Enabled/Non-Critical error)
[WARNING]: Memory module DIMM A3 (32.0GB) status is: WARNING
[OK]: All storage controllers (3), volumes (1) and disk drives (2) are in good condition

Supermicro, tripped intrusion sensor:
# ./check_redfish.py -H XXXXXX -u ADMIN -p XXXXXX --info --storage --sel
[CRITICAL]: INFO: Supermicro SSG-620P-E1CR24H (CPU: 2, MEM: 256GB) - BIOS: 1.4 - Serial: XXXXXXXXXX - Power: On - Name: NOT SET
[CRITICAL]: 2023-09-21T19:27:20Z: [SEC-0000] General chassis intrusion
[OK]: BMC: ASPEED (Firmware: 01.01.34) and all nics are in 'OK' state.
[OK]: All storage controllers (5), volumes (25) and disk drives (30) are in good condition

Supermicro, all good:
# ./check_redfish.py -H XXXXXX -u ADMIN -p XXXXXX --info --storage
[OK]: INFO: Supermicro SSG-620P-E1CR24H (CPU: 2, MEM: 256GB) - BIOS: 1.4 - Serial: XXXXXXXXXX - Power: On - Name: NOT SET
[OK]: All storage controllers (5), volumes (25) and disk drives (30) are in good condition
bb-Ricardo commented 6 months ago

That sounds great, thank you for the feedback.