Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2k stars 574 forks source link

Problem notification with OK State #9720

Open BT-Danny opened 1 year ago

BT-Danny commented 1 year ago

Describe the bug

We encountered a weird behaviour several times:

Sometimes, when states changes too often, Icinga seems to be confused by all the state changes. Youcan see in the given example how often the state changes, but the criticals are always in Soft States. When it changes to Hard State OK, it seems to be confused by the the Check Output, which is critical, and it sends a problem notification with State OK.

The Service should be in State Critical, according to the Check

To Reproduce

  1. Service is flapping
  2. Service exits the flapping state with state OK
  3. Check is critical but state is OK
  4. Icinga sends

Expected behavior

Usually icinga shouldn't send a notification, when the service state is "OK". There should be an internal evaluation, since a Problem isn't corresponding an OK state

Screenshots

image

image

Your Environment

julianbrost commented 1 year ago

Do you know what the check actually returned at the time in question? So can you tell for sure whether the state or the output is wrong?

Can you please share available logs from that time?

BT-Danny commented 1 year ago

Well, actually the state is wrong, since the check seems to return the correct exit code, but Icinga seems to misinterpret that. Maybe because of the often statechanges?

icinga2.log:

[2023-03-09 06:45:27 +0100] information/Checkable: Checkable 'hostXY!powershell-cpu' has 3 notification(s). Checking filters for type 'Problem', sends will be logged. [2023-03-09 06:45:27 +0100] information/Notification: Sending 'Problem' notification 'hostXY!powershell-cpu!SMS Service Notification 7x24' for user 'SMS' [2023-03-09 06:45:27 +0100] information/Notification: Sending 'Problem' notification 'hostXY!powershell-cpu!JIRA Notification 7x24 - Without Certificate Services' for user 'userXY' [2023-03-09 06:45:27 +0100] information/Notification: Completed sending 'Problem' notification 'hostXY!powershell-cpu!SMS Service Notification 7x24' for checkable 'hostXY!powershell-cpu' and user 'SMS' using command 'sms-service-notification'. [2023-03-09 06:45:27 +0100] information/Notification: Completed sending 'Problem' notification 'hostXY!powershell-cpu!JIRA Notification 7x24 - Without Certificate Services' for checkable 'hostXY!powershell-cpu' and user 'userXY' using command 'jira-service-notification'. [2023-03-09 06:45:28 +0100] warning/PluginNotificationTask: Notification command for object 'hostXY!powershell-cpu' (PID: 14004, arguments: '/opt/icinga2/notifications/jira_notifications/monitoring_jira_notification.py' '-I' '12595' '-P' 'Blocker' '-S' 'OK' '-a' '' '-cf' 'customfield_13840,customfield_14040' '-cv' '1460,' '-h' 'hostXY' '-ipv4' 'some_ip' '-n' 'PROBLEM' '-o' '[CRITICAL] CPU Load [CRITICAL] Core Total (100%)

The last log is our custom notification script, which will look for tickets in the ticketsystem and if it doesn't find anything, it will create a new ticket. What's interesting about it are the argument '-S' which indicates the State of the Service/Host and '-n' which indicates the issuetype. The state OK doesn't fit the issuetype problem.

We implemented an evaluation in the script which compares the state and the issuetype, and if they don't match, the script will exit without sending out informations to the ticketingsystem:

2023-03-09 06:45:28,036|INFO|monitoring_jira_notification.py:50|Function:check_arguments|[14004] Notification type PROBLEM and state OK do not match. Script will exit

Unfortunately we didn't imlpement this in the sms-notification script yet, which made us aware of this bug.

We are very confused by the acutal trigger of the notification. There isn't actually any reason for the service to trigger the notification

Greetings: Daniel

BTMichel commented 1 year ago

Update We have faced the exact error again today, where a Recovery Notification was sent out, even though there was no problematic hard state reached before: image

In this screenshot you can clearly see, that the state was already OK at 13/03 04:04:18 but there was still a Recovery Notification sent out on the next check at 09:54:35: image

Could this be a sync issue between our two masters? We are facing incorrect notifications more and more often now.

BT-Danny commented 1 year ago

Hello @julianbrost

did you found any solutions for the problem? As BTMichel mentioned are we facing this problem pretty often right now.

Greetings: Daniel

BT-Danny commented 1 year ago

Hello,

we are facing this problem now more often than ever. We evaluated some informations now:

We just received several notifications this morning regarding a "problem" which was adressed as "OK": image

image

Regarding the second screenshot, we checked the logs and found this: [2023-04-28 13:32:48 +0200] information/Notification: Sending reminder 'Problem' notification 'hostname!servicename!Notification 7x24' for user 'Contact'

As you can see, it evaluates it as Problem, but it's shown as "OK" in icingaweb.

We saw, that in "inspect" the state was set to "2", which should be "critical". I then did "process check result" to manually reset it, but first it just ignored it and it stayed with state: 2 After that, I did "process check result" with "OK" again and it reset the State to 0, which should be "OK"

We also checked the database, and it showed the same results as icingaweb history.

Greetings: Daniel

BT-Danny commented 1 year ago

Hello everybody,

did you already found anything regarding the problem?

Greetings: Daniel

BT-Danny commented 1 year ago

Hello everybody, friendly reminder :)

Greetings: Daniel

BT-Danny commented 1 year ago

Hello everybody,

we recently encountered same problems again: image

As you can see, there are some weird messages regarding a problem and an OK notification.

Much regards: Daniel

BT-Basit commented 4 months ago

Hi there,

we just experienced the problem again. Icingaweb shows that there was a notification but it wasn't sent out to any contact: image

icinga2.log shows that the notification was sent out to a specific contact and produced a ticket in our system:

[2024-06-13 14:41:59 +0200] information/Checkable: Checkable '<host>!<service>' has 1 notification(s). Checking filters for type 'Problem', sends will be logged. [2024-06-13 14:43:21 +0200] information/Notification: Sending reminder 'Problem' notification '<host>!<service>!<notification>' for user '<my contact>' [2024-06-13 14:43:21 +0200] information/Notification: Completed sending 'Problem' notification '<host>!<service>!<notification>' for checkable '<host>!<service>' and user '<my contact>' using command '<notification command>'.

On top of that the database table icinga_statehistory shows no state change after 12th june:

MySQL [icinga]> select * from icinga_statehistory where object_id = 77877 and state_time >= '2024-06-12 00:00:00' order by state_time desc \G *************************** 1. row *************************** statehistory_id: 24350989 instance_id: 1 state_time: 2024-06-12 07:04:14 state_time_usec: 601166 object_id: 77877 state_change: 1 state: 0 state_type: 1 current_check_attempt: 1 max_check_attempts: 3 last_state: 2 last_hard_state: 0 output: PING OK - Packet loss = 0%, RTA = 18.29 ms long_output: check_source: icinga1 endpoint_object_id: 2643 *************************** 2. row ***************************

I don't understand what is happening here and why Icingaweb, icinga2.log and the database all show different behaviour.