Closed tdemin closed 6 months ago
Minimal sample manifest set for testing:
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
name: gitrepository-that-always-fails
spec:
url: https://this-repo-doesnt-exist.example.com/repo.git
interval: 1m
ref:
branch: master
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
name: gitrepository-that-always-fails
spec:
summary: A GitRepository is failing
eventSeverity: error
eventSources:
- kind: GitRepository
name: gitrepository-that-always-fails
providerRef:
name: alertmanager
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Provider
metadata:
name: alertmanager
spec:
type: alertmanager
secretRef:
name: alertmanager-config
---
apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: alertmanager-config
stringData:
address: http://alertmanager:9093/api/v2/alerts
---
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
name: alertmanager
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager
spec:
type: ClusterIP
selector:
alertmanager: alertmanager
ports:
- name: http-web
port: 9093
protocol: TCP
targetPort: 9093
I propose removing this label altogether. Alertmanager already provides us with .StartsAt / .EndsAt, posting a new alert would just set the new "startsAt". This would technically be a breaking change, as existing users might rely on this label in their alert templates.
I agree that for Alertmanager removing the timestamp would be better as it's recored in startsAt
. We can ship this breaking change in the next minor release Flux v2.3 if you can open a PR.
Description
notification-controller sets
timestamp
label on outgoing alerts, preventing Alertmanager from recognizing subsequent alerts with the same label set as a single alert (Alertmanager considers newly-posted alerts with an existing label set to be the same updated alert). This means Alertmanager will dispatch multiple alerts for what is essentially a single outage (e.g. failingGitRepository
with a short reconciliation interval, e.g.1m
, which is the default forflux bootstrap
-generatedGitRepository
)This also requires us to add alert receivers with
send_resolved: false
separate from already-configured Prometheus receivers, as Flux alerts would be expiring at the default Alertmanagerresolve_timeout
of5m
, with Flux posting new ones over time.Alertmanager (as of v0.27.0) seems to lack the ability to drop incoming alert labels.
Proposal
While
group_by
combined with a long enoughgroup_interval
might make subsequent alerts less annoying, it still means we end up with multiple outgoing Alertmanager alerts in a single message, possibly grouped with the templates.I propose removing this label altogether. Alertmanager already provides us with
.StartsAt
/.EndsAt
, posting a new alert would just set the new"startsAt"
. This would technically be a breaking change, as existing users might rely on this label in their alert templates.I'm willing to work on a PR for this if needed.
System details
Flux v2.2.3, notification-controller v1.2.4, Alertmanager v0.27.0