fix: Add back the helm suspended metric

fluxcd / helm-controller

The GitOps Toolkit Helm reconciler, for declarative Helming

https://fluxcd.io

Apache License 2.0

414 stars 164 forks source link

fix: Add back the helm suspended metric #1111

Closed jmleddy closed 3 days ago

jmleddy commented 4 days ago

At some point we had this and then we lost it. Discovered after we started suspending a bunch of things but could not get this metric to appear, meaning we are currently in a quasi-state of releases suspended across all our clusters that we don't know about.

stefanprodan commented 3 days ago

The suspend label has been included in the gotk_resource_info which is provided by kube-state-metrics. I recommend you migrate your alerts and dashboard to the new metric as the old ones have been deprecated long ago. Docs here: https://fluxcd.io/flux/monitoring/metrics/#resource-metrics

jmleddy commented 3 days ago

This is extra toil for us to get a back metric that we were alerting on and have completely lost. We have no idea how many helm releases are paused not applying resource request updates or whatever. And it's inconsistently applied. Why do our kustomizations still report when they are stalled but our helm releases don't? I realize that there are different maintainers that have different opinions about what metrics should be exposed, but to the end user this all just looks like "flux", since all the controllers come with flux.

For anyone that might find this PR and wonder what the kube-state-metrics config is, seems to be here

stefanprodan commented 3 days ago

I realize that there are different maintainers that have different opinions

The core maintainers make the decisions for the common behaviour of all Flux controllers and the metrics fall into this category. We made the decision to drop the resource specific metrics from the controllers exporters and rely on kube-prometheus-stack. The deprecation notice can be found here: https://fluxcd.io/flux/monitoring/metrics/#warning-deprecated-resource-metrics

This controller was last promoted to GA, so we removed the deprecated metrics from it, but we should've done that in all controllers. We'll make sure the old metrics are removed in the next release across all Flux components.

jmleddy commented 3 days ago

Thank you, though I would prefer you add back this metric everywhere to avoid requiring everyone to add 275 lines of yaml to their kube-prometheus-stack helm chart, the inconsistency is even worse than the first decision as it feels uneven. And probably also led us to slower detection of the issue as it was still finding suspends "sometimes". So looking forward to having a consistent view from the Flux controller maintainers here :)

jmleddy commented 3 days ago

Also, is this you? https://github.com/fluxcd/flux2-monitoring-example/issues/35#issuecomment-2226364651

stefanprodan commented 3 days ago

@jmleddy you can read our motivation in this issue: https://github.com/fluxcd/flux2/issues/4128

If you don't like the kube-state-metrics approach feel free to use the Flux Operator, the tradeoff is that you can't customise those metrics in any way.

jmleddy commented 3 days ago

Okay I thought the controller was running as part of the operator I must not have my kube config right. I'll run it as part of the ksm thanks!