bb-Ricardo / check_redfish

A monitoring/inventory plugin to check components and health status of systems which support Redfish. It will also create a inventory of all components of a system.
MIT License
110 stars 30 forks source link

plugin event : somme event are filtered by default : Power supply redundancy is lost #108

Closed weeboo closed 10 months ago

weeboo commented 1 year ago

On the event plugin, some events does not appear like : Critical Redundancy Lost Power supply redundancy is lost. Critical Assert The power input for power supply 2 is lost.

It seem that they are filtered by default.

Is this normal ? And how to unfiltered them ?

bb-Ricardo commented 1 year ago

Hi,

Not sure where this issue comes from.

Can you post the command and the output here?

Also, which type of server is requested and which version of this plugin are yoh using?

That would help a lot narrow down the issue.

Thank you.

weeboo commented 1 year ago

Hi, Version: 1.5.0 (2023-02-24) This is a Dell poweredge R740 the command and the output in the attached file sel-verbose.txt

bb-Ricardo commented 1 year ago

Hi,

yes they are omitted as they have been resolved. You should be able to see them if you add the --detailed command line option.

weeboo commented 1 year ago

ok, with the --detailed option i see : [OK]: 2023-02-16T18:44:59+01:00: The power supplies are redundant. [OK]: 2023-02-16T18:44:56+01:00: The input power for power supply 2 has been restored. [OK]: 2023-02-16T07:54:24+01:00: Power supply redundancy is lost. (severity 'CRITICAL' cleared) [OK]: 2023-02-16T07:54:16+01:00: The power input for power supply 2 is lost. (severity 'CRITICAL' cleared)

But it would be interesting with an argument to log it as a warning. In order to catch some power flapping for example

bb-Ricardo commented 1 year ago

good point,

I will try to come up with a solution.

bb-Ricardo commented 1 year ago

hey @weeboo,

I just added a flap detection for this case:

    # if a log entry has been auto cleared this amount of times within the alert level time range
    # then issue an additional WARNING message
    flapping_threshold_critical = 2
    flapping_threshold_warning = 5

would you be able to test this?

weeboo commented 1 year ago

Hello, thanks, I will test it next week

weeboo commented 1 year ago

Hello, I was unable to reproduce the problem. Can you clarify in which case this should match as I'm not sure I understand the logic

bb-Ricardo commented 1 year ago

ok, with the --detailed option i see :

[OK]: 2023-02-16T18:44:59+01:00: The power supplies are redundant. 
[OK]: 2023-02-16T18:44:56+01:00: The input power for power supply 2 has been restored. 
[OK]: 2023-02-16T07:54:24+01:00: Power supply redundancy is lost. (severity 'CRITICAL' cleared) 
[OK]: 2023-02-16T07:54:16+01:00: The power input for power supply 2 is lost. (severity 'CRITICAL' cleared)

But it would be interesting with an argument to log it as a warning. In order to catch some power flapping for example

If you run the plugin with --critical 1 --warning 3 and unplug one power supply for a minute and replug it, then no alarm should appear. If you do it a second time a WARNING alarm should show up about a possible flapping alarm state.

Would you be able to test this scenario?

bb-Ricardo commented 10 months ago

published with version 1.6.0