GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
189 stars 86 forks source link

Fix/issue 685 persisting silences in alertmanager #1069

Open GorginZ opened 1 month ago

GorginZ commented 1 month ago
GorginZ commented 1 month ago

This looked like very low-hanging fruit and the issue has some recent activity so thought I'd pick this up.

I haven't exercised this in a cluster or written tests yet, just quickly put this in and looked at resulting templates for different values files which looked satisfactory - but will do proper testing when I have time this week.

TheSpiritXIII commented 1 month ago

Thanks so much! This is a great change. This doesn't fully solve #685 since default-on managed users cannot edit and apply these manifests, but it works for those who self-install GMP. :)

I'm just curious -- have you tried Kustomize as an alternative? You could even use it with Helm: https://trstringer.com/helm-kustomize/

The team did not expect users to use our Helm chart (yet). We're only using it for templating at the moment. I wonder what other flags users might want. For example, maybe someone would want a static ConfigMap as opposed to a PVC 🤔

Another idea the team had which we really liked was to introduce a new custom resource for silences (and so we could rebuild any temp dirs, or stuff it into a ConfigMap).

Otherwise, thanks for the PR. It's great to see interest.

GorginZ commented 1 month ago

Heya, I'm working with a client at the moment who uses GMP, but sadly we don't leverage the alertManager component so we aren't actually impacted by silences not being persisted ourselves - I would like us to use alertManager, but the appetite was to just make alerts with the native monitoring options that already exist and just leverage the metrics.

I just noted the open issue because we had a funny slip the other day where a coworker accidentally deleted every CRD and I noticed prom operator does log errors (can't get it's configs), I was wondering if it should complain a little louder and was looking to see what issues were open (I did not open an issue and don't necessarily need to, like it's pretty easy to deduce what's wrong, but that's why I was browsing issues and saw this one and thought I'd start looking at it)

The team did not expect users to use our Helm chart (yet). We're only using it for templating at the moment.

Yeah I did note the notices throughout the repo.

Thanks so much! This is a great change. This doesn't fully solve https://github.com/GoogleCloudPlatform/prometheus-engine/issues/685 since default-on managed users cannot edit and apply these manifests, but it works for those who self-install GMP. :)

And did realize this shortly after picking it up and when I brought up a cluster last night to apply manifests and realized really all I can do is show that the compiled manifests are "fine" for the three different configurations people would do for self-installs...

Another idea the team had which we really liked was to introduce a new custom resource for silences (and so we could rebuild any temp dirs, or stuff it into a ConfigMap).

Considering the relationship the manifests have with the installation on GKE clusters I can see why this would be an attractive approach. hmm.

Happy to polish for at least the self-install case at least.

Looks like you guys are fairly proactive here so I'd be keep on helping with other stuff too where I can.

GorginZ commented 3 weeks ago

Wanted to apply templated manifests in a cluster this arvo to do some testing evidence, but running off main first for sanity ofc and hitting this https://github.com/GoogleCloudPlatform/prometheus-engine/issues/1013

I'll just like revert the readonly stuff introduced in https://github.com/GoogleCloudPlatform/prometheus-engine/pull/944 while I fiddle locally :)

also

maybe someone would want a static ConfigMap as opposed to a PVC 🤔

Another idea the team had which we really liked was to introduce a new custom resource for silences (and so we could rebuild any temp dirs, or stuff it into a ConfigMap).

gotta get it up and running first, but I'll think further about these ideas. I initially like configMap idea most - probably more than having a PVC. And I may prefer it to CRDs just bc less objects/simplicity.