Open hhollenstain opened 1 year ago
This would be nice to have.
Would also be great to have the alerts solve themself automatically once they are not happening anymore.
It would generally be nice to be able to customize the fields content. If you have a large OpsGenie/JSM instance where alerts from multiple systems are processed, you want to have some more info than e.g. "Kustomization/somecomponent" in the title of the alert as this is no very specific.
event.Metadata gets injected in the payload sent to OpsGenie so you can add cluster name, region, etc. Can't this be used for deduplication?
In JSM I see this in the created alert:
(
summary
and testField
were added by me in spec.eventMetadata
of the Flux alert)
which according to the Jira API documentation matches this field:
And event.Metadata
seems to refer to spec.eventMetadata
and is also part part of the payload:
payload := OpsgenieAlert{
Message: event.InvolvedObject.Kind + "/" + event.InvolvedObject.Name,
Description: event.Message,
Details: event.Metadata,
With some Jira automation rules deduplication and title manipulation could work (need to check with some admin there on our side). Customization of the fields on Flux side would be a bit easier in my eyes, but see it as a feature request 😃. Many thanks @stefanprodan for the hint!
For Opsgenie i just set the alias to the description, works out most of the time, as long as the description does not contain a time string that is always different.
What would be great however would be a message that could also close the alert. So like a "recovery" message.
Flux is stateless, there is no way to send recovery messages as notification-controller doesn't know it has send a previous error alert.
For Opsgenie i just set the alias to the description, works out most of the time, as long as the description does not contain a time string that is always different.
What would be great however would be a message that could also close the alert. So like a "recovery" message.
That's also possible. What I use for the alias: Message title (which I enriched with some more text like the cluster name) plus the revision that comes as metadata field from Flux by default.
So, for me the alias
looks like this: [FluxCD] ({{extraProperties.cluster}}) {{message}} {{extraProperties.revision}}
As the message
contains also the Kustomization name and I am only sending alerts about Kustomizations, this should be enough. Of course s.o. could let the alert stay open and in the meantime there is another issue in the cluster for this Kustomization that does no more match the alert description. But as you @al-lac said: When the description contains a time string, the deduplication won't work.
@stefanprodan True, guess the only way that would work would be to send messages for every run that was ok, which would be a little noisy.
@al-lac we do send a success event only once, when the recovery happens, see https://github.com/fluxcd/kustomize-controller/blob/e9f5628eccbfbc722a7637ecbf7f66580e2e4416/internal/controller/kustomization_controller.go#L910-L914
@PatrickZeier-SAG how did you manage to enrich it with the cluster name? Did you just add more information to spec. eventMetadata
?
@stefanprodan i guess i would need to set the eventSeverity
to info
right? I guess i would need to filter on this then when creating / resolving alerts.
@al-lac Exactly. That's the alert:
apiVersion: notification.toolkit.fluxcd.io/v1beta2
kind: Alert
metadata:
name: jsm
namespace: somenamespace
spec:
providerRef:
name: jsm
eventSeverity: error
eventSources:
- kind: Kustomization
name: '*'
namespace: somenamespace
eventMetadata:
cluster: "mycluster"
And this I can then access like described above with {{extraProperties.cluster}}
in JSM (probably OpsGenie as well, never tested).
@PatrickZeier-SAG thanks that works perfectly!
Now i just need to find a way to differentiate between errors, infos and recovery messages 😁
@al-lac I would be happy to read about your solution if you find something 😃 . Especially the recovery message (I did not yet get that out of the code Stefan linked).
Idea for differentiation between severity types: You could add one Flux alert per severity but with different value in the eventMetadata
. E.g. severity: error
. Then you can parse this field in JSM/OpsGenie and set the alert priority or whatever you want to do with that info.
@PatrickZeier-SAG ah yeah that is one way of handling this. Thanks for the tip!
Yeah me neither, i don't see a way on how a recovery message is different from the rest. Maybe @stefanprodan can elaborate further.
For the same revision, Flux will emit a single info event and not spam. If let's say for some new Git commit the health check fails, if it recovers you get 2 events error and info.
Ok, i thought of doing it the way like @PatrickZeier-SAG suggested it. So to have two alerts for info and error. But as the info also contains the error part i cannot use it to close the alert as they would always get in the way of each other.
@stefanprodan ok that is good to know. But how will i be able to differentiate between error and info if this info does not get sent to the provider? If i would have the error level (info / error), i could just match on the revision and resolve the alert once a new info message comes in with the same resource id.
So i would set the following as an alias on OpsGenie: Resource>-main@<commit-id
But without the information if it is an error or info i cannot do the closing :-(
But without the information if it is an error or info i cannot do the closing
Feel free to open a PR, all you need is adding event.Severity
to the payload.
So with the changes from #796 i managed to set the eventSeverity
to info
and filtering on Opsgenie so only errors are made into an alert.
However, i seem to not get enough info
alerts with the following configuration:
---
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
name: gitops-notifications-opsgenie
namespace: flux-system
spec:
summary: Alert from flux for cluster a
providerRef:
name: opsgenie
eventSeverity: info
eventSources:
- kind: GitRepository
name: '*'
namespace: cluster-a
- kind: Kustomization
name: '*'
namespace: cluster-a
- kind: HelmRelease
name: '*'
namespace: cluster-a
Should i not also get an alert then for every Reconciliation finished
?
I let one kustomization fail and repaired it again, but i never got any recovery message or info message in Opsgenie...
The only thing i see in the Opsgenie log is this alert coming in every time a sync runs:
CustomResourceDefinition/clustersecretstores.external-secrets.io configured
We currently utilize Opsgenie for paging and found the integration with flux works pretty well. The main issue of contention is missing the alias for deduplication. Currently when an alert is triggered it will continuous fire/create new pages. Ideally we can set an alias and fire once and let Opsgenie handle additional notification/triggers.
https://github.com/fluxcd/notification-controller/blob/505345cb3ecc006ed71fae02d5988de7acd65ac9/internal/notifier/opsgenie.go#L68
Opsgenie API docs