Open mikecote opened 1 year ago
Pinging @elastic/response-ops (Team:ResponseOps)
degraded: yes; error: no - we've had problems in the past where when we set a status of error, Kibana really goes into a downspin, so we backed off from doing that. Never report a status error!
Not all users use alerting, do we want to report degraded status for those users too? Not all alerting rules use alerts-as-data, do we want to report degraded status for non FAAD users too? (or wait until all rules use FAAD?)
Is the suggestion we should see if there ARE any rules, and not report a degraded status in that case? Then we would expect to go degraded, once they created the rules? Getting complex :-). I think just always having the plugin report degraded would be ine.
I mentioned in the triage session that I had seen a case where Kibana never leaving a degraded state on startup was causing issues. It was this one: https://github.com/elastic/kibana/issues/156615 . AFAIK that's a bug, and I believe that's the only functional error I've seen in such cases.
IMO, not having AAD based alerting available after an upgrade should be called out with a fatal/critical status to be addressed urgently by the cluster administrator, so let´s please consider this as part of the approach taken here.
Adding this to To-Do in case we get capacity to complete this in 8.11. I believe it's important to mark our service as degraded
when we fail to update our AAD assets in Elasticsearch. It'll be less evident to our users after @ymao1's improvements where we skip updating the mappings and allow alerts to write (deferring any issue(s) at a later time). The system admins should still be aware in case they don't look at the server logs and we should make sure, if possible, we provide the end-user directions to fix the issue so their system is no longer degraded. Otherwise runbooks / support articles..
We should make sure a reason
is provided, if possible.
Question from @ymao1: Should we also / instead leverage the alerting health API and display a banner in the UI?
Now that the alerting framework is setting up the alerts-as-data assets in Elasticsearch, we can provide a better experience whenever the assets have failed to install or upgrade. In this issue, we should report a degraded or error status (up for discussion) whenever an issue is encountered.
Things to consider: