Yelp / elastalert

Easy & Flexible Alerting With ElasticSearch
https://elastalert.readthedocs.org
Apache License 2.0
8k stars 1.73k forks source link

Send recovery alerts #288

Open jasonrhaas opened 9 years ago

jasonrhaas commented 9 years ago

It is common with server monitoring tools to send a "resolve" message when the problem that triggered the alert has recovered. It would be nice if there was something in ElastAlert that would send a message out if the next query did not yield the same alert.

For example, if I get a "flatline" alert, I fix the problem, and ElastAlert no longer alerts on that issue, it should send a "recovery" message out to tell whomever is listening that the issue is resolved.

More specifically, I'm using the PagerDuty API to track ElastAlert alerts, and would like to make use of the "resolve events" API.

https://developer.pagerduty.com/documentation/integration/events/resolve

zetsub0u commented 9 years ago

Hi, i talked with the guys on irc about something similar i wanted to implement, basically extending all the alerts that occur over a period of time (frequency, flatline, spike, etc) with something like the EventWindow, but maybe an AlertWindow which starts on the first alert and tracks it until the alert "expires" (ie, no more alerts after x time). I don't know if i'm going to be able to work on this anytime soon but i just wanted to comment and give a :+1:

bolshoy commented 9 years ago

@jasonrhaas I have a simple implementation for this, running it for two weeks already and it looks fine. The logic is simple: if there is no alert for the rule in the current run, check in the writeback index if we have an alert stored for the previous run of this rule. If there was an alert, try to resolve it. Resolver is implemented by the corresponding alerter. See https://github.com/rounds/elastalert/commit/efc449295636bddee913f4bd3c61d3a857e1d339

jasonrhaas commented 9 years ago

thanks @bolshoy for sharing. Something like this would be a really nice addition to ElastAlert. It could be another YAML option that is set per alert, like resolve_alert: true or something to that effect.

bolshoy commented 9 years ago

@jasonrhaas right, this switch might be needed. In our case, alerts are sent to Sensu, and it always needs resolving. Hopefully I'll have some spare time soon l try to create a proper PR.

fiunchinho commented 8 years ago

Any news on this?

tomwganem commented 8 years ago

Would really like this to be implemented.

eravion commented 8 years ago

+1 :)

iainlbc commented 8 years ago

+1

Mormaii commented 8 years ago

+1

bean5 commented 8 years ago

For flatline, I agree that this is entirely possible, and is something I was looking to do. I would foresee, though, that for other rule types, doing this become more complex, specifically when query_key is set.

I think one way of doing this is to allow any rule to specify a flatline rule whose time bounds must be after the first rule fires. But I don't understand all the internals of ElastAlert and am certainly still learning about ES. If anyone has a better way to do this, just pipe in here.

bean5 commented 8 years ago

Another way to do this would be to make rules able to be dependent/blocked by other rules. Or just define rules as being able to auto-resolve any other rule upon firing.

bobbyhubbard commented 7 years ago

@bolshoy Did you make any progress on a pull? Maybe one of us could work on a pull based on your latest?

bolshoy commented 7 years ago

@bobbyhubbard stopped working on this altogether, using Prometheus instead

JC1738 commented 7 years ago

Would definitely be a nice feature, thought about using a second rule that would be the clearing alert, though this would be difficult to maintain.

tkuther commented 7 years ago

I would love to see this, too. Currently I'm doing flatline | frequency pingpong with command alert to Alerta.

supernomad commented 7 years ago

So I just slammed into this brick wall myself, and had a thought about how this could be possible.

Essentially elastalert triggers only when a match is made on the query, could it also be told to trigger on the reverse, and no match is made? this would require being able to set fields to different values, for instance a status field in the post data of an HTTP POST or victorops_message_type in VO. This would also obviously require a switch to turn the functionality on or off.

Any thoughts on the above?

pblasquez commented 7 years ago

If the status was configurable to send to compatible outputs, e.g. for PagerDuty, set a new variable 'pagerduty_status' to one of: ['trigger','resolve','acknowledge'] with a default of 'trigger', it would at least cover cases where this could be set explicitly by query with a separate rule.

I know it's not a global solution but it would be welcomed for the outputs where it is possible. It is already available to do things this way for the JIRA output.

bean5 commented 7 years ago

@pblasquez: That would be a simple way to make it work. I agree that we should default to trigger for backwards-compatibilty. This is a low cost way achieve what this issue wants--although to some it may be a work-around rather than actual feature.

The only catch would be to make sure that the correct PagerDuty API version is used as I assume version one does not allow resolving. That is just an assumption, though.

Opening #1304 to do this.

bean5 commented 7 years ago

@jasonrhaas: Do you think your idea should only apply to the flatline rule?

jasonrhaas commented 7 years ago

@bean5 My idea was to have it apply to any rule that triggers an alert. If you have used DataDog before, it is similar to that.

bean5 commented 7 years ago

@pblasquez https://github.com/Yelp/elastalert/pull/1304 was accepted, so you should be able to do what you proposed. Although I wrote a test case for the code, I did not actually use it against PagerDuty, so it may be buggy. The PagerDuty API seemed to indicate that both their API versions should accept it. Let me know if it doesn't work for you.

@jasonrhaas I implemented what @pblasquez proposed. It works for PagerDuty use-cases, which is what you mentioned in your first post here. That being said, you had the other idea of resolve_alert: true. I can definitely understand how your idea applies to flatline since resolving/triggering in such a case makes sense. Same goes for Any rules. But for my typical use cases for rules like whitelist/blacklist, I'd easily run into cases where there is an offending document that occurs just once. For such use cases, not occurring anymore does not mean resolved--it just means that it only occurred once, but still requires RCA. I suppose there could be cases where no longer occurring implies resolution, but I would use flatline for those. (I would make flatline resolve instead of the heartbeat case where flatline = trigger.) Note: I have not used DataDog before.

Perhaps the best way for me to understand is to ask:

Qmando commented 7 years ago

I've had a couple thoughts about this for a while, here's what I imagine:

bean5 commented 7 years ago

So you have already put a decent amount of thought into auto-resolving in general. Given the side, I think it merits its own issue. This one can be closed (we solved the PagerDuty work-around), right?

I'm not quite sure what you mean by "differentiate alerts from the same rule" because in this project a rule triggers an alert, a 1-1 mapping.

Qmando commented 7 years ago

By that, I basically meant if you are using query_key. Just like how silence stashes are created per query_key value. Ex, flatline alert with query_key on hostname. If host1 goes flat, then host2 goes flat, but only host2 comes back, you don't want to resolve the host1 flatline.

bean5 commented 7 years ago

Oh, I agree on that. Definitely. For non-PagerDuty alerters, that may be sufficient. And for a first version, that should be sufficient.

When it comes to PagerDuty, unless you differentiate the incident in their title, they are considered the same incident. At least I think it is by title. So for PagerDuty, even if in EA they are considered 2 events, unless precautions are appropriately taken (differentiated by title), resolving one incident will actually resolve the other. Perhaps we could append query_key to the title of PagerDuty events automatically or if a rule indicates to do this? I think this project has a way to programmatically do that manually, via arguments, but it would be nice to have it automatically done as a matter of process. This is a PD-related consideration only. So leaving it for follow-up work seems appropriate.

Qmando commented 7 years ago

As of https://github.com/Yelp/elastalert/commit/b301f2aa385aa1ef1c4859c3659870fa90183a12, you can set a custom pagerduty incident title, possibly using query_key. The default doesn't use query_key though, as email and jira subjects do, it probably should.

pblasquez commented 7 years ago

Yes, you set 'pagerduty_incident_key' using 'pagerduty_incident_key_args'.

It is up to the user to keep things sufficiently specific so they can target the same incident key for resolution.

bean5 commented 7 years ago

Awesome. So the support is there for PagerDuty. I knew that at one time. What remains is to add support for 'auto_resolve' then, if anything, right? Close this ticket and open one for that, or leave this open? I can go either way.

meltingrobot commented 6 years ago

Would still love to get alerts closed automagically with VictorOps.

Atem18 commented 5 years ago

Hi any news about that ? Should we create the recovery alert manually ?

Qmando commented 5 years ago

@Atem18

I probably wouldn't get your hopes up too much for this. It's a fair amount of work to implement in a generic way, and we unfortunately aren't really doing work on new features right now.

For some alert types, you can implement this by creating a second alert which is an inverse of the original. For example, with the jira alerter, you can transition issues to closed. Other types might not be so easy.

If you give some specifics I might be able to help guide you.

ahbrown1 commented 5 years ago

If this feature is still dead in the water (generic alert recovery), I may have to implement something super hacky & ugly and stuff all the logic in an enhancement module, with the Elastalert config file doing the upfront stuff, like just handling the index and query_key

Qmando commented 5 years ago

Someone implemented this for a few alerters: https://github.com/Yelp/elastalert/pull/2446/files

Haven't looked through all of it, but maybe it's a good place to start

nsano-rururu commented 3 years ago

I have an inventory of problems. I think this problem has been solved. If it has been resolved, close it.

meltingrobot commented 3 years ago

@nsano-rururu I read through the changelog, I do not see anywhere that a resolve alert feature was added. I do not think this was ever fixed.

diogokiss commented 3 years ago

Any news on this issue? It would be really helpful to have this feature implemented. :-/

aclowkey commented 3 years ago

Perhaps this issue should be moved to https://github.com/jertel/elastalert2. Since this repo is no longer maintained?