elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.76k stars 8.17k forks source link

Should the Kibana Alerting plugin report a degraded or error health plugin status when FAAD fails to install resources? #159035

Open mikecote opened 1 year ago

mikecote commented 1 year ago

Now that the alerting framework is setting up the alerts-as-data assets in Elasticsearch, we can provide a better experience whenever the assets have failed to install or upgrade. In this issue, we should report a degraded or error status (up for discussion) whenever an issue is encountered.

Things to consider:

elasticmachine commented 1 year ago

Pinging @elastic/response-ops (Team:ResponseOps)

pmuellr commented 1 year ago

degraded: yes; error: no - we've had problems in the past where when we set a status of error, Kibana really goes into a downspin, so we backed off from doing that. Never report a status error!

Not all users use alerting, do we want to report degraded status for those users too? Not all alerting rules use alerts-as-data, do we want to report degraded status for non FAAD users too? (or wait until all rules use FAAD?)

Is the suggestion we should see if there ARE any rules, and not report a degraded status in that case? Then we would expect to go degraded, once they created the rules? Getting complex :-). I think just always having the plugin report degraded would be ine.

pmuellr commented 1 year ago

I mentioned in the triage session that I had seen a case where Kibana never leaving a degraded state on startup was causing issues. It was this one: https://github.com/elastic/kibana/issues/156615 . AFAIK that's a bug, and I believe that's the only functional error I've seen in such cases.

heespi commented 1 year ago

IMO, not having AAD based alerting available after an upgrade should be called out with a fatal/critical status to be addressed urgently by the cluster administrator, so let´s please consider this as part of the approach taken here.

mikecote commented 1 year ago

Adding this to To-Do in case we get capacity to complete this in 8.11. I believe it's important to mark our service as degraded when we fail to update our AAD assets in Elasticsearch. It'll be less evident to our users after @ymao1's improvements where we skip updating the mappings and allow alerts to write (deferring any issue(s) at a later time). The system admins should still be aware in case they don't look at the server logs and we should make sure, if possible, we provide the end-user directions to fix the issue so their system is no longer degraded. Otherwise runbooks / support articles..

mikecote commented 1 year ago

We should make sure a reason is provided, if possible.

mikecote commented 1 year ago

Question from @ymao1: Should we also / instead leverage the alerting health API and display a banner in the UI?