flapjack / flapjack

Monitoring notification routing + event processing system. For issues with the Flapjack packages, please see https://github.com/flapjack/omnibus-flapjack/
http://flapjack.io
MIT License
639 stars 92 forks source link

Random missing notification #916

Open phoebuspf opened 8 years ago

phoebuspf commented 8 years ago

Randomly only receive recover notification without receiving critical notification. 2015-11-04T17:24:25.392297-08:00 [INFO] :: flapjack-processor :: Generating notification for event xxx.xxx.com:API-Peeps-URLCheck, service, critical, HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 502 Proxy Error, 2015-11-04 17:24:25 -0800 2015-11-04T17:26:25.551119-08:00 [INFO] :: flapjack-processor :: Generating notification for event xxx.xxx.com:API-Peeps-URLCheck, service, ok, HTTP OK: Status line output matched "200" - 18892 bytes in 0.179 second response time, 2015-11-04 17:26:25 -0800 2015-11-04T17:26:25.589330-08:00 [INFO] :: flapjack-notifier :: Enqueueing email alert for xxx.xxx.com:API-Peeps-URLCheck to xxxx@xxxx.com type: recovery rollup: - 2015-11-04T17:26:25.600752-08:00 [INFO] :: flapjack-notifier :: Enqueueing sms_twilio alert for xxx.xxx.com:API-Peeps-URLCheck to 6502186026 type: recovery rollup: - 2015-11-04T17:26:25.615102-08:00 [INFO] :: flapjack-notifier :: Enqueueing slack alert for xxx.xxx.com:API-Peeps-URLCheck to my-channel type: recovery rollup: - 2015-11-04T17:26:25.620278-08:00 [DEBUG] :: flapjack-slack :: payload: "{\"channel\":\"#my-channel\",\"username\":\"Flapjack\",\"text\":\"\n\n\n\n\n\n\nRecovery: 'API-Peeps-URLCheck' on xxx.xxx.comis OKat 4 Nov 17:26, HTTP OK: Status line output matched \\\"200\\\" - 18892 bytes in 0.179 second response time\n\n\",\"icon_emoji\":\":bawl:\"}" 2015-11-04T17:26:25.761432-08:00 [INFO] :: flapjack-email :: Sent alert successfully: Alert via email:XXXXXXXX to contact 152e914b-ca48-40ce-c8e3-5010b7faee2b (john walker): Recovery - 'API-Peeps-URLCheck' on xxx.xxx.com is OK - HTTP OK: Status line output matched "200" - 18892 bytes in 0.179 second response time 2015-11-04T17:26:26.151929-08:00 [DEBUG] :: flapjack-slack :: server response: ok 2015-11-04T17:26:26.152391-08:00 [INFO] :: flapjack-slack :: Sent alert successfully: Alert via slack:my-channel to contact 1528e14b-ca48-40ce-c8e3-5010b7faee2b (john walker): Recovery - 'API-Peeps-URLCheck' on xxx.xxx.com is OK - HTTP OK: Status line output matched "200" - 18892 bytes in 0.179 second response time 2015-11-04T17:26:26.152535-08:00 [DEBUG] :: flapjack-slack :: Sent message via Slack, response status is 200, notification_id: 30641960-1446686785.614842 2015-11-04T17:26:26.152728-08:00 [DEBUG] :: flapjack-slack :: slack gateway is going into blpop mode on slack_notifications 2015-11-04T17:26:26.201918-08:00 [INFO] :: flapjack-sms_twilio :: Sent alert successfully: Alert via sms_twilio:6502186026 to contact 1528e14b-ca48-40ce-c8e3-5010b7faee2b (john walker): Recovery - 'API-Peeps-URLCheck' on xxx.xxx.com is OK - HTTP OK: Status line output matched "200" - 18892 bytes in 0.179 second response time

phoebuspf commented 8 years ago

The first flapjack-processor didn't notify, second one did, so only received recover without receiving critical. This is random, happened twice today, out of around 30 events.

ghost commented 8 years ago

This is likely due to faulty logic on the recover, firing when the initial notification delay hadn't passed for the starting problem event, but I haven't had time to debug it for Flapjack v1 (in master changes the approach a little, and records the last notification state, so avoids the error state).

phoebuspf commented 8 years ago

@ali-graham thank you for the reply, looking at the timestamp, there are 2 minutes before the event recover, is it a bug that the flapjack-notifier not triggered?

ghost commented 8 years ago

On 05/11/15 13:03, Franky wrote:

@ali-graham https://github.com/ali-graham thank you for the reply, looking at the timestamp, there are 2 minutes before the event recover, is it a bug that the flapjack-notifier not triggered?

I may be describing a separate bug, sorry. :/

jessereynolds commented 8 years ago

I guess the place to start here would be a cucumber test that replicates the timing of the check results arriving to be processed and seeing if you can reproduce the problem that way.

phoebuspf commented 8 years ago

I guess the place to start here would be a cucumber test that replicates the timing of the check results arriving to be processed and seeing if you can reproduce the problem that way.

can not reproduce...

jessereynolds commented 8 years ago

@phoebuspf if you cannot provide steps to reproduce then there's not much we can do.

phoebuspf commented 8 years ago

Thank you for the reply. The thing is that I was not able to reproduce it... I wish to, but it does not show up again and I have no clue. What will happen if: current entities have error, and then those entities was replaced(removed) and nagios restarted(with new entities that is likely different from previous entities?

phoebuspf commented 8 years ago

Hi, could you simulate this on your machine? #!/bin/bash for i inseq 1 10; do echo $i nohup flapjack simulate fail --entity apis-test$i --check bacon --interval 1 --time 0.6 & done

phoebuspf commented 8 years ago

The above script on my machine just filed 3 slack notifications(where flapjack notifier was triggered) instead of 10, and what is worse, one of the 3 failed with this: 2015-11-25T12:03:30.915883-08:00 [DEBUG] :: flapjack-slack :: payload: "{\"channel\":\"#XXX-flapjack\",\"username\":\"Flapjack\",\"text\":\"\",\"icon_emoji\":\":bawl:\"}" which result in my previous not receiving problem. Could you provide your result?

jessereynolds commented 8 years ago

I wonder if this is the rollup threshold taking effect. What is your contact's rollup threshold set to for slack?