Open framsouza opened 1 week ago
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)
Thanks for writing this up. In your use case, what is the scale we're talking about here? How often do you have an SLO that needs deleting vs updating?
From slack (https://elastic.slack.com/archives/C044PV8EJ4X/p1730729974044599?thread_ts=1730725339.130429&cid=C044PV8EJ4X)
I'd just like to clarify that its not the SLO that need removing, its the instance of an SLO that needs to be cleaned up. When using group_by aggs, an instance of the SLO is made for every unique value in the selected group_by field. We use it alot for project_id which is a fields with high cardinality, and it is perfectly possible for a value to be removed (on project deletion).
Thanks for following up, @drewpost! In our case, the scale is quite large, we’re managing thousands of SLOs, and over time, quite a few become outdated or irrelevant. We usually find that deletions are more common than updates, especially as services evolve or get deprecated. It’s not uncommon for large batches of SLOs to need periodic cleanup
Description
Currently, there is no automated cleanup feature for SLOs, and as a result, our existing SLOs may not accurately reflect the true reliability of our services. We propose a solution to introduce an automated cleanup mechanism for SLOs to ensure that only relevant and up-to-date SLOs are maintained in the production environment.
Currently, to clean up SLOs, we run an
update_by_query
against the SLO indices. However, we need a more straightforward method for users and customers to clean up their SLOs without added hassleProblem Statement:
group_by
fields, resulting in inaccurate reliability metricsno_data
for extended periods. An automatic removal of SLOs with ano_data
status for more than X hours would help maintain only meaningful and actionable SLOs.Ideas/Solutions:
no_data
Status: Allow SLOs with ano_data
status to be automatically removed if this condition persists for more than a configurable duration (e.g., X hours).group_by
Fields: Implement checks to ensure that SLOs referencing non-existentgroup_by
fields are either flagged for review or automatically removed, depending on the configuration.Benefits
This feature would help maintain a cleaner and more accurate set of SLOs that reflect only the SLOs that actually matters/works and by reducing the need for manual cleanup, engineers can focus on other critical tasks, improving overall productivity.