fluxcd / notification-controller

The GitOps Toolkit event forwarder and notification dispatcher
https://fluxcd.io
Apache License 2.0
147 stars 130 forks source link

Opsgenie Alias for deduplication of alerts #460

Open hhollenstain opened 1 year ago

hhollenstain commented 1 year ago

We currently utilize Opsgenie for paging and found the integration with flux works pretty well. The main issue of contention is missing the alias for deduplication. Currently when an alert is triggered it will continuous fire/create new pages. Ideally we can set an alias and fire once and let Opsgenie handle additional notification/triggers.

https://github.com/fluxcd/notification-controller/blob/505345cb3ecc006ed71fae02d5988de7acd65ac9/internal/notifier/opsgenie.go#L68

Opsgenie API docs

al-lac commented 1 year ago

This would be nice to have.

Would also be great to have the alerts solve themself automatically once they are not happening anymore.

PatrickZeier-SAG commented 3 months ago

It would generally be nice to be able to customize the fields content. If you have a large OpsGenie/JSM instance where alerts from multiple systems are processed, you want to have some more info than e.g. "Kustomization/somecomponent" in the title of the alert as this is no very specific.

stefanprodan commented 3 months ago

event.Metadata gets injected in the payload sent to OpsGenie so you can add cluster name, region, etc. Can't this be used for deduplication?

PatrickZeier-SAG commented 3 months ago

In JSM I see this in the created alert: image (summaryand testField were added by me in spec.eventMetadata of the Flux alert)

which according to the Jira API documentation matches this field: image

And event.Metadata seems to refer to spec.eventMetadata and is also part part of the payload:

payload := OpsgenieAlert{
        Message:     event.InvolvedObject.Kind + "/" + event.InvolvedObject.Name,
        Description: event.Message,
        Details:     event.Metadata,

With some Jira automation rules deduplication and title manipulation could work (need to check with some admin there on our side). Customization of the fields on Flux side would be a bit easier in my eyes, but see it as a feature request 😃. Many thanks @stefanprodan for the hint!

al-lac commented 3 months ago

For Opsgenie i just set the alias to the description, works out most of the time, as long as the description does not contain a time string that is always different.

What would be great however would be a message that could also close the alert. So like a "recovery" message.

stefanprodan commented 3 months ago

Flux is stateless, there is no way to send recovery messages as notification-controller doesn't know it has send a previous error alert.

PatrickZeier-SAG commented 3 months ago

For Opsgenie i just set the alias to the description, works out most of the time, as long as the description does not contain a time string that is always different.

What would be great however would be a message that could also close the alert. So like a "recovery" message.

That's also possible. What I use for the alias: Message title (which I enriched with some more text like the cluster name) plus the revision that comes as metadata field from Flux by default.

So, for me the alias looks like this: [FluxCD] ({{extraProperties.cluster}}) {{message}} {{extraProperties.revision}}

As the message contains also the Kustomization name and I am only sending alerts about Kustomizations, this should be enough. Of course s.o. could let the alert stay open and in the meantime there is another issue in the cluster for this Kustomization that does no more match the alert description. But as you @al-lac said: When the description contains a time string, the deduplication won't work.

al-lac commented 3 months ago

@stefanprodan True, guess the only way that would work would be to send messages for every run that was ok, which would be a little noisy.

stefanprodan commented 3 months ago

@al-lac we do send a success event only once, when the recovery happens, see https://github.com/fluxcd/kustomize-controller/blob/e9f5628eccbfbc722a7637ecbf7f66580e2e4416/internal/controller/kustomization_controller.go#L910-L914

al-lac commented 3 months ago

@PatrickZeier-SAG how did you manage to enrich it with the cluster name? Did you just add more information to spec. eventMetadata?

@stefanprodan i guess i would need to set the eventSeverity to info right? I guess i would need to filter on this then when creating / resolving alerts.

PatrickZeier-SAG commented 3 months ago

@al-lac Exactly. That's the alert:

apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Alert
metadata: 
  name: jsm
  namespace: somenamespace
spec: 
  providerRef: 
    name: jsm
  eventSeverity: error
  eventSources:     
    - kind: Kustomization
      name: '*'
      namespace: somenamespace
  eventMetadata: 
    cluster: "mycluster"

And this I can then access like described above with {{extraProperties.cluster}} in JSM (probably OpsGenie as well, never tested).

al-lac commented 3 months ago

@PatrickZeier-SAG thanks that works perfectly!

Now i just need to find a way to differentiate between errors, infos and recovery messages 😁

PatrickZeier-SAG commented 3 months ago

@al-lac I would be happy to read about your solution if you find something 😃 . Especially the recovery message (I did not yet get that out of the code Stefan linked).

Idea for differentiation between severity types: You could add one Flux alert per severity but with different value in the eventMetadata. E.g. severity: error. Then you can parse this field in JSM/OpsGenie and set the alert priority or whatever you want to do with that info.

al-lac commented 3 months ago

@PatrickZeier-SAG ah yeah that is one way of handling this. Thanks for the tip!

Yeah me neither, i don't see a way on how a recovery message is different from the rest. Maybe @stefanprodan can elaborate further.

stefanprodan commented 3 months ago

For the same revision, Flux will emit a single info event and not spam. If let's say for some new Git commit the health check fails, if it recovers you get 2 events error and info.

al-lac commented 3 months ago

Ok, i thought of doing it the way like @PatrickZeier-SAG suggested it. So to have two alerts for info and error. But as the info also contains the error part i cannot use it to close the alert as they would always get in the way of each other.

@stefanprodan ok that is good to know. But how will i be able to differentiate between error and info if this info does not get sent to the provider? If i would have the error level (info / error), i could just match on the revision and resolve the alert once a new info message comes in with the same resource id.

So i would set the following as an alias on OpsGenie: Resource>-main@<commit-id

But without the information if it is an error or info i cannot do the closing :-(

stefanprodan commented 3 months ago

But without the information if it is an error or info i cannot do the closing

Feel free to open a PR, all you need is adding event.Severity to the payload.

al-lac commented 3 months ago

So with the changes from #796 i managed to set the eventSeverity to info and filtering on Opsgenie so only errors are made into an alert.

However, i seem to not get enough info alerts with the following configuration:

---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
    name: gitops-notifications-opsgenie
    namespace: flux-system
spec:
    summary: Alert from flux for cluster a
    providerRef:
        name: opsgenie
    eventSeverity: info
    eventSources:
      - kind: GitRepository
        name: '*'
        namespace: cluster-a
      - kind: Kustomization
        name: '*'
        namespace: cluster-a
      - kind: HelmRelease
        name: '*'
        namespace: cluster-a

Should i not also get an alert then for every Reconciliation finished?

I let one kustomization fail and repaired it again, but i never got any recovery message or info message in Opsgenie...

The only thing i see in the Opsgenie log is this alert coming in every time a sync runs: CustomResourceDefinition/clustersecretstores.external-secrets.io configured