Open phoebuspf opened 8 years ago
The first flapjack-processor didn't notify, second one did, so only received recover without receiving critical. This is random, happened twice today, out of around 30 events.
This is likely due to faulty logic on the recover, firing when the initial notification delay hadn't passed for the starting problem event, but I haven't had time to debug it for Flapjack v1 (in master changes the approach a little, and records the last notification state, so avoids the error state).
@ali-graham thank you for the reply, looking at the timestamp, there are 2 minutes before the event recover, is it a bug that the flapjack-notifier not triggered?
On 05/11/15 13:03, Franky wrote:
@ali-graham https://github.com/ali-graham thank you for the reply, looking at the timestamp, there are 2 minutes before the event recover, is it a bug that the flapjack-notifier not triggered?
I may be describing a separate bug, sorry. :/
I guess the place to start here would be a cucumber test that replicates the timing of the check results arriving to be processed and seeing if you can reproduce the problem that way.
I guess the place to start here would be a cucumber test that replicates the timing of the check results arriving to be processed and seeing if you can reproduce the problem that way.
can not reproduce...
@phoebuspf if you cannot provide steps to reproduce then there's not much we can do.
Thank you for the reply. The thing is that I was not able to reproduce it... I wish to, but it does not show up again and I have no clue. What will happen if: current entities have error, and then those entities was replaced(removed) and nagios restarted(with new entities that is likely different from previous entities?
Hi, could you simulate this on your machine?
#!/bin/bash for i in
seq 1 10; do echo $i nohup flapjack simulate fail --entity apis-test$i --check bacon --interval 1 --time 0.6 & done
The above script on my machine just filed 3 slack notifications(where flapjack notifier was triggered) instead of 10, and what is worse, one of the 3 failed with this:
2015-11-25T12:03:30.915883-08:00 [DEBUG] :: flapjack-slack :: payload: "{\"channel\":\"#XXX-flapjack\",\"username\":\"Flapjack\",\"text\":\"\",\"icon_emoji\":\":bawl:\"}"
which result in my previous not receiving problem.
Could you provide your result?
I wonder if this is the rollup threshold taking effect. What is your contact's rollup threshold set to for slack?
Randomly only receive recover notification without receiving critical notification. 2015-11-04T17:24:25.392297-08:00 [INFO] :: flapjack-processor :: Generating notification for event xxx.xxx.com:API-Peeps-URLCheck, service, critical, HTTP CRITICAL - Invalid HTTP response received from host: HTTP/1.1 502 Proxy Error, 2015-11-04 17:24:25 -0800 2015-11-04T17:26:25.551119-08:00 [INFO] :: flapjack-processor :: Generating notification for event xxx.xxx.com:API-Peeps-URLCheck, service, ok, HTTP OK: Status line output matched "200" - 18892 bytes in 0.179 second response time, 2015-11-04 17:26:25 -0800 2015-11-04T17:26:25.589330-08:00 [INFO] :: flapjack-notifier :: Enqueueing email alert for xxx.xxx.com:API-Peeps-URLCheck to xxxx@xxxx.com type: recovery rollup: - 2015-11-04T17:26:25.600752-08:00 [INFO] :: flapjack-notifier :: Enqueueing sms_twilio alert for xxx.xxx.com:API-Peeps-URLCheck to 6502186026 type: recovery rollup: - 2015-11-04T17:26:25.615102-08:00 [INFO] :: flapjack-notifier :: Enqueueing slack alert for xxx.xxx.com:API-Peeps-URLCheck to my-channel type: recovery rollup: - 2015-11-04T17:26:25.620278-08:00 [DEBUG] :: flapjack-slack :: payload: "{\"channel\":\"#my-channel\",\"username\":\"Flapjack\",\"text\":\"\n\n\n\n\n\n\n
Recovery: 'API-Peeps-URLCheck' on xxx.xxx.comis OKat 4 Nov 17:26, HTTP OK: Status line output matched \\\"200\\\" - 18892 bytes in 0.179 second response time
\n\n\",\"icon_emoji\":\":bawl:\"}" 2015-11-04T17:26:25.761432-08:00 [INFO] :: flapjack-email :: Sent alert successfully: Alert via email:XXXXXXXX to contact 152e914b-ca48-40ce-c8e3-5010b7faee2b (john walker): Recovery - 'API-Peeps-URLCheck' on xxx.xxx.com is OK - HTTP OK: Status line output matched "200" - 18892 bytes in 0.179 second response time 2015-11-04T17:26:26.151929-08:00 [DEBUG] :: flapjack-slack :: server response: ok 2015-11-04T17:26:26.152391-08:00 [INFO] :: flapjack-slack :: Sent alert successfully: Alert via slack:my-channel to contact 1528e14b-ca48-40ce-c8e3-5010b7faee2b (john walker): Recovery - 'API-Peeps-URLCheck' on xxx.xxx.com is OK - HTTP OK: Status line output matched "200" - 18892 bytes in 0.179 second response time 2015-11-04T17:26:26.152535-08:00 [DEBUG] :: flapjack-slack :: Sent message via Slack, response status is 200, notification_id: 30641960-1446686785.614842 2015-11-04T17:26:26.152728-08:00 [DEBUG] :: flapjack-slack :: slack gateway is going into blpop mode on slack_notifications 2015-11-04T17:26:26.201918-08:00 [INFO] :: flapjack-sms_twilio :: Sent alert successfully: Alert via sms_twilio:6502186026 to contact 1528e14b-ca48-40ce-c8e3-5010b7faee2b (john walker): Recovery - 'API-Peeps-URLCheck' on xxx.xxx.com is OK - HTTP OK: Status line output matched "200" - 18892 bytes in 0.179 second response time