GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
196 stars 93 forks source link

Persisting silences in alertmanager #685

Open marwanad opened 1 year ago

marwanad commented 1 year ago

In the managed alertmanager, alertmanager-data is of EmptyDir which means that configured silences and notification states won't persist on pod restarts. Is there a way to have a configurable PVC for the data dir with the managed alertmanager?

bwplotka commented 1 year ago

That's correct, thanks for raising this.

Alertmanager is a statefulset, but with best-effort emptyDir volume which does not guarantee any persistence. In self-deployment that's possible, since you can modify the Alertmanager resource, but not in managed GMP.

We could discuss this feature as a team if you want, it feels like something we could consider, but of lower priority. Also help wanted to contribute this feature, might get it done faster.

Just curious what's your use case for managed alertmanager? Would our recent cloud feature in preview PromQL for Cloud Monitoring Alerting help?

marwanad commented 12 months ago

@bwplotka thanks for the response! I think there was no way to disable the deployment of the managed alertmanager through the GMP operator at the time so we ended up utilizing it instead of having duplicate deployments.

So it's basically the same use-case for an unmanaged alertmanager, at the time we couldn't define PromQL rules in cloud monitoring + we needed more control over the slack notification channel configs, pagerduty etc. The preview feature looks interesting and covers a subset of our use-case but we'll still need alertmanager for generic webhook channels.

lyanco commented 12 months ago

Note that Cloud Alerting PromQL does support generic webhook channels: https://cloud.google.com/monitoring/support/notification-options#webhooks

taldejoh commented 11 months ago

We are facing the same problem. All of our silences are gone on pod restart and we need to recreate all of them manually. In the last two weeks it happened two times. So this improvement would also be very helpful for us!

bwplotka commented 8 months ago

Sorry for lag, it's on our radar again, we are brainstorming how to enable persistent volumes here.

Interestingly there is a very nasty "persistent" workaround for silences in the meantime https://github.com/prometheus/alertmanager/issues/1673#issuecomment-819421068 (thanks @TheSpiritXIII for the finding!)

bwplotka commented 7 months ago

Just quick question to users who care about this feature, which managed collection (this operator) deployment model you use?

1️⃣ the one available on GKE (fully managed). If that's the case, how you submit the silences? 2️⃣ self-deployed operator (via kubectl). If that's the case, what stops you from manually adjusting Alertmanager Statefulset YAML for your needs and re-applying it? Operator will managed that one (as long as you keep the labels, namespace and name the same) just fine.

cc @m3adow @marwanad @taldejoh

marwanad commented 7 months ago

@bwplotka appreciate the updates on this :)

We were using option 1 and setting the silences by port-forwarding to the running alertmanager instance and adding them through the UI or using amtool to submit them.

We've then switched to a self deployed alertmanager instance to get more control over this and setting alertmanagers field in the operator config to point to our self-managed instance.

m3adow commented 7 months ago

We're using option 1 as well. We're currently in the process of migrating from kube-prometheus-stack to GMP and we want to have as much of the "GM", as possible. 😄
Right now, we're also using port-forwarding and the UI to silence alerts. As the alerts are sent to Teams channels, we don't have an option to silence the alerts later on in the alerting chain.

bwplotka commented 7 months ago

Epic, thanks for clarifications!