[Stack Monitoring] Alerting Phase -1

cachedout commented 5 years ago

This ticket tracks the work which needs to be completed to achieve Phase -1 which is outlined in the proposal document.

To complete this phase, we need to build out the plumbing to connect to the Stack Monitoring application to the Kibana Alerting Framework.

All watches need to be present and functional using the new framework:

[x] ~~elasticsearch_cluster_status~~ https://github.com/elastic/kibana/pull/61685
[x] elasticsearch_nodes
[x] elasticsearch_version_mismatch
[x] kibana_version_mismatch
[x] logstash_version_mismatch
[x] ~~xpack_license_expiration.json~~ https://github.com/elastic/kibana/pull/54306
[ ] Get resolution on when we can create alerts: https://github.com/elastic/kibana/issues/59813
[x] ~~Prevent access to UI unless gold+ since that is required to make email action~~ No longer the case, since the merge of https://github.com/elastic/kibana/pull/87377
[ ] Incorporate default notifications for out of the box alerting: https://github.com/elastic/kibana/issues/51547
[x] Blocked until ES adds an api to disable watcher-base cluster alerts: https://github.com/elastic/elasticsearch/issues/50032

elasticmachine commented 5 years ago

Pinging @elastic/stack-monitoring

chrisronline commented 5 years ago

Update here.

I found a couple of blockers while taking a first stab at this and raised them here: https://github.com/elastic/kibana/issues/45571

chrisronline commented 5 years ago

The effort is going well here. I don't have a PR ready yet, but I hope to have it this week. (Update: Draft PR available)

Some updated notes on this effort:

We need to figure out how we handle the state of alerts firing - with Watcher, we write to the .monitoring-alerts-* index, but I think we can avoid an additional index by leveraging the persisted state for actions. We are blocked on this because we need a way to access this state, see https://github.com/elastic/kibana/issues/48442
We need to figure out the right way to disabling cluster alerts (watches). I've outlined some thoughts on this issue
I'm thinking we'll want to progressively add these into master (instead of one big merge) and if so, we should think about if we want to disable these until they are all in, or do we want to enable at least one from the start and have it co-exist with the other watches?
With watcher, we require users to specify an email address to receive alerts in their kibana.yml - we can continue this trend, or we can allow them to specify it in the UI when they enable Kibana alerts, and then we store it in a saved object or something.

igoristic commented 5 years ago

Nice work @chrisronline 💪 Can't wait to see it!

We need to figure out how we handle the state of alerts firing - with Watcher, we write to the .monitoring-alerts-* index

Once "Kibana Alerting" is live are we completely deprecating/removing the current/old Alerting?

I think we might still want a new index, just in case some setups still have the old .monitoring-alerts-* with legacy documents (or for some reason we need to support both ES and Kibana alerting). We can abbreviate it with something like -kb like we do -mb for Metricbeat.

I'm thinking we'll want to progressively add these into master (instead of one big merge)

💯

With watcher, we require users to specify an email address to receive alerts in their kibana.yml

I prefer in the Kibana UI, just because it's more UI friendly, and they can modify the info without restarting, but I don't mind continuing the yml trend.

chrisronline commented 5 years ago

Thanks for the thoughts @igoristic!

Once "Kibana Alerting" is live are we completely deprecating/removing the current/old Alerting?

I guess it depends on if we want a slow rollout of these migrations. If so, we will be living in a world where both are running and exist at the same time (not for the same alert check, but we'll have some watcher based cluster alerts and some kibana alerts)

I think we might still want a new index, just in case some setups still have the old .monitoring-alerts-* with legacy documents (or for some reason we need to support both ES and Kibana alerting). We can abbreviate it with something like -kb like we do -mb for Metricbeat.

You don't think we can accomplish the same UI from just using the state provided by the alerting framework? I think that's really all we need since we'll store data in there that tells us when the alert fired and if it's been resolved yet.

I prefer in the Kibana UI, just because it's more UI friendly, and they can modify the info without restarting, but I don't mind continuing the yml trend.

Yea I agree the UI route is better, but if we do a slow rollout, it might be confusing for folks who already have the kibana.yml config set - I think we need to make a call on the slow rollout and that will help inform us of how to handle these other issues.

igoristic commented 5 years ago

You don't think we can accomplish the same UI from just using the state provided by the alerting framework? I think that's really all we need since we'll store data in there that tells us when the alert fired and if it's been resolved yet.

I guess I don't really know how the current implantation well enough to validate my concern. My worry is that if an ES Alert is triggered it'll be added to the index which will then be picked up by both ES Alerts and KB Alerts which might duplicate some actions like sending two emails etc...

I just think a new index can help avoid any of this issues we might not yet foresee (maybe for the same reason Metricbeat has its own -mb indices?)

This is all based on speculation though

chrisronline commented 5 years ago

I guess I don't really know how the current implantation well enough to validate my concern. My worry is that if an ES Alert is triggered it'll be added to the index which will then be picked up by both ES Alerts and KB Alerts which might duplicate some actions like sending two emails etc...

Ah, I see the confusion here.

Part of this work involves disabling (or blacklisting per @cachedout's idea) the cluster alert when we enable the Kibana alert. We'd never have a situation (intentionally) where both the cluster alert for xpack license expiration, and the Kibana alert for xpack license expiration are running at the same time.

cachedout commented 5 years ago

I'm thinking we'll want to progressively add these into master (instead of one big merge) and if so, we should think about if we want to disable these until they are all in, or do we want to enable at least one from the start and have it co-exist with the other watches?

I think that gradually merging these and leaving them disabled until we are ready to switch the new alerting on in the application is the right thing to do. It gives us time to develop and test the alerts while minimizing the disruption for the user.

ypid-geberit commented 3 years ago

I was forwarded to this issue from https://github.com/elastic/elasticsearch/issues/34814#issuecomment-655538854. The "Phase -1 which is outlined in the proposal document." is not linked so I don’t have knowledge of that so excuse me if this is beyond the scope of "Phase 1".

As a Elastic Stack admin, I feel the "Stack Monitoring" falls short compared to other Monitoring systems. For example, there is no concept of Hard and Soft States. And I am not convinced that it would be a good idea to replicate this using Elastic watcher (I tried for my own use and failed). See https://github.com/elastic/elasticsearch/issues/34814#issuecomment-655359733 for more details.

igoristic commented 3 years ago

Thank you @ypid-geberit for your feedback

As a Elastic Stack admin, I feel the "Stack Monitoring" falls short compared to other Monitoring systems. For example, there is no concept of Hard and Soft States

I think this is a good request feature, but perhaps out of scope within the context of this ticket.

@ravikesarwani Maybe this is something we can add a ticket for in SM feature requests roadmap

ravikesarwani commented 3 years ago

Many of the out of the box stack monitoring alerts provide users the full flexibility to control the notifications (including what method to get notified with based on license level) and when they are generated. For example "CPU Usage" has the default to alert when CPU is over 85% looking at average over last 5 minutes. Both 85% and 5 minutes duration can easily be adjusted by the users.

Also with https://github.com/elastic/kibana/issues/91145 we will allow users to create multiple alerts and be able to handle feature similar to soft and hard states. For example "Say user wants to alert when CPU is 75% for last 5 minutes and send an email. When its 85% for last 10 minutes they want to send a pagerduty alert."

ypid-geberit commented 3 years ago

Sounds like what @ravikesarwani wrote addresses it. I am looking forward to it :)

elastic / kibana

[Stack Monitoring] Alerting Phase -1 #42960