flapjack / flapjack

Monitoring notification routing + event processing system. For issues with the Flapjack packages, please see https://github.com/flapjack/omnibus-flapjack/
http://flapjack.io
MIT License
639 stars 92 forks source link

Flood of notifications skips rollup #534

Open blalor opened 10 years ago

blalor commented 10 years ago

Not sure I'll be able to describe this well (and I sure as hell don't want to reproduce it!), but here goes.

Flapjack v0.9.0. Once instance running with all components (gateways, notifications, etc) and another instance with just the processor component running.

We're still in PoC mode with Flapjack: it's adding notifications on top of our existing Nagios infrastructure. We just had an outage that resulted in a couple of thousand check failures triggering -- and then recovering -- all at once. My notification rule is rolling up alerts above a threshold of 5, at a half-hour interval for email and 5 minutes for Jabber. When the outage started I was getting individual alerts delivered to both media (below the rollup threshold). Then the alert flood came and I got hundreds of individual notifications to both Jabber and email all at once.

At another point, I also got a flood of rolled-up alerts:

This has happened a couple of other times (to a lesser degree, fortunately).

I've also noticed that acknowledgements aren't coalesced into fewer messages. While I do want timely notification that stuff's recovered, I don't want a thousand messages. :-)

auxesis commented 10 years ago

@blalor sorry to hear this happened :frowning:

Did the summary threshold eventually kick in on the WARNING/CRITICAL alerts?

blalor commented 10 years ago

No. The last time I ran into this I ended up culling a couple thousand emails from my inbox.

I'm sorry to say that this is one of the reasons I've had to shelve my Flapjack PoC. :-(

auxesis commented 10 years ago

Ah, that's a bummer.

Are there any other blockers on your PoC?

blalor commented 10 years ago

Yes.

Flapjack has a lot of promise, but even without the issues above I felt it wasn't what we were looking for. I'm working with Riemann, now, and I think it's going to be a better fit all-around, although at the expense of the clear UI and email notifications. Also the interactivity with the system via Jabber; I loved that.

auxesis commented 10 years ago

Great to hear that you're finding Riemann a better fit for your problem domain.

There's overlap between Flapjack and Riemann, and both have slightly different views of the world that work for different people solving slightly different problems.

None of the issues you've reported in Flapjack are particularly hard to solve, they just take time. Riemann has certainly got a lot of great momentum behind it, and hopefully any rough edges you find in Riemann are smoothed off faster than they have been in Flapjack. :rocket:

I'll ping you again once we've closed the issues you mentioned above.

ghost commented 10 years ago

I have a suspicion that the behaviour in this issue may be related to the lack of transactionality in the rollup code -- i.e. multiple events are triggering and all believe that they should be the rollup one, as the data hasn't been updated in time. This will be much easier to fix in v2.0 -- I may work on it in that branch, as it's easier to experiment with different approaches, and then backport the required logic.