fluxcd / notification-controller

The GitOps Toolkit event forwarder and notification dispatcher
https://fluxcd.io
Apache License 2.0
151 stars 133 forks source link

Alerts sent to Alertmanager contain timestamps as labels, preventing alert grouping #793

Closed tdemin closed 6 months ago

tdemin commented 6 months ago

Description

notification-controller sets timestamp label on outgoing alerts, preventing Alertmanager from recognizing subsequent alerts with the same label set as a single alert (Alertmanager considers newly-posted alerts with an existing label set to be the same updated alert). This means Alertmanager will dispatch multiple alerts for what is essentially a single outage (e.g. failing GitRepository with a short reconciliation interval, e.g. 1m, which is the default for flux bootstrap-generated GitRepository)

This also requires us to add alert receivers with send_resolved: false separate from already-configured Prometheus receivers, as Flux alerts would be expiring at the default Alertmanager resolve_timeout of 5m, with Flux posting new ones over time.

Alertmanager (as of v0.27.0) seems to lack the ability to drop incoming alert labels.

Multiple alerts in Alertmanager

Proposal

While group_by combined with a long enough group_interval might make subsequent alerts less annoying, it still means we end up with multiple outgoing Alertmanager alerts in a single message, possibly grouped with the templates.

I propose removing this label altogether. Alertmanager already provides us with .StartsAt / .EndsAt, posting a new alert would just set the new "startsAt". This would technically be a breaking change, as existing users might rely on this label in their alert templates.

I'm willing to work on a PR for this if needed.

System details

Flux v2.2.3, notification-controller v1.2.4, Alertmanager v0.27.0

tdemin commented 6 months ago

Minimal sample manifest set for testing:

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: gitrepository-that-always-fails
spec:
  url: https://this-repo-doesnt-exist.example.com/repo.git
  interval: 1m
  ref:
    branch: master
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
  name: gitrepository-that-always-fails
spec:
  summary: A GitRepository is failing
  eventSeverity: error
  eventSources:
    - kind: GitRepository
      name: gitrepository-that-always-fails
  providerRef:
    name: alertmanager
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Provider
metadata:
  name: alertmanager
spec:
  type: alertmanager
  secretRef:
    name: alertmanager-config
---
apiVersion: v1
kind: Secret
type: Opaque
metadata:
  name: alertmanager-config
stringData:
  address: http://alertmanager:9093/api/v2/alerts
---
apiVersion: monitoring.coreos.com/v1
kind: Alertmanager
metadata:
  name: alertmanager
---
apiVersion: v1
kind: Service
metadata:
  name: alertmanager
spec:
  type: ClusterIP
  selector:
    alertmanager: alertmanager
  ports:
    - name: http-web
      port: 9093
      protocol: TCP
      targetPort: 9093
stefanprodan commented 6 months ago

I propose removing this label altogether. Alertmanager already provides us with .StartsAt / .EndsAt, posting a new alert would just set the new "startsAt". This would technically be a breaking change, as existing users might rely on this label in their alert templates.

I agree that for Alertmanager removing the timestamp would be better as it's recored in startsAt. We can ship this breaking change in the next minor release Flux v2.3 if you can open a PR.