elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.63k stars 8.23k forks source link

Automatic SLO Cleanup Mechanism #198776

Open framsouza opened 1 week ago

framsouza commented 1 week ago

Description

Currently, there is no automated cleanup feature for SLOs, and as a result, our existing SLOs may not accurately reflect the true reliability of our services. We propose a solution to introduce an automated cleanup mechanism for SLOs to ensure that only relevant and up-to-date SLOs are maintained in the production environment.

Currently, to clean up SLOs, we run an update_by_query against the SLO indices. However, we need a more straightforward method for users and customers to clean up their SLOs without added hassle

Problem Statement:

Ideas/Solutions:

Benefits

This feature would help maintain a cleaner and more accurate set of SLOs that reflect only the SLOs that actually matters/works and by reducing the need for manual cleanup, engineers can focus on other critical tasks, improving overall productivity.

elasticmachine commented 1 week ago

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

drewpost commented 1 week ago

Thanks for writing this up. In your use case, what is the scale we're talking about here? How often do you have an SLO that needs deleting vs updating?

neiljbrookes commented 1 week ago

From slack (https://elastic.slack.com/archives/C044PV8EJ4X/p1730729974044599?thread_ts=1730725339.130429&cid=C044PV8EJ4X)

I'd just like to clarify that its not the SLO that need removing, its the instance of an SLO that needs to be cleaned up. When using group_by aggs, an instance of the SLO is made for every unique value in the selected group_by field. We use it alot for project_id which is a fields with high cardinality, and it is perfectly possible for a value to be removed (on project deletion).

framsouza commented 1 week ago

Thanks for following up, @drewpost! In our case, the scale is quite large, we’re managing thousands of SLOs, and over time, quite a few become outdated or irrelevant. We usually find that deletions are more common than updates, especially as services evolve or get deprecated. It’s not uncommon for large batches of SLOs to need periodic cleanup

jasonrhodes commented 5 hours ago

Related: https://github.com/elastic/kibana/issues/195266