Investigate Alerting: Implement Alerting for ElasticSearch in Dev

Jose-Matsuda commented 2 years ago

Current Status 20/09/2022

[ ] Needs a bit of elaboration on the other side, align things architecturally so the cns alerting and this alerting can work side by side. Maybe by end of this week (Friday 23rd)

Taken off of Pat's comment

Steps

[x] Create a PrometheusRule in argocd that monitors the ElasticSearch pvcs, I would hope that this just gets picked up (just try it), also this
[ ] Configure alert manager in dev to consume said rule
- [ ] Will need to get a url for a slack channel to send alerts to

Need to make changes in the values here, at this alertmanager probably

Possible other issues to read through https://github.com/prometheus-community/helm-charts/issues/393

Important

Do not make any changes or PRs even to the cloud repos

Jose-Matsuda commented 2 years ago

Port forwarding alertmanager and navigating to http://localhost:9093/#/status gives me the current config, need to find out where this is done / how I can change it say via argocd.

Though note that it does say

Can maybe use a configmap?

Could possibly do it

https://github.com/prometheus-community/helm-charts/blob/dd70d54ea0cef913140a78b918afac88d7c8ef2e/charts/kube-prometheus-stack/values.yaml#L485

Note that it is mounted here (in the alertmanager pod)

is controlled by

and is populated with the values

which is the default from the chart

Jose-Matsuda commented 2 years ago

Prometheus Rules Note that some already exist in the volume

In the PR below I was able to get our own test Prometheus Rules to be recognized and in there

Jose-Matsuda commented 2 years ago

Pat has graciously directed me to the following repos on how we end up with Prometheus etc in the cluster.

We have the specific terraform-kubernetes-kube-prometheus-stack to install the stack, which is referenced by the generic terraform-statcan-kubernetes-core-platform which is referenced by our specific repo terraform-statcan-aaw-platform for our own clusters

Jose-Matsuda commented 2 years ago

Passing along our AlertManager configuration.

Pat also gave more insight, saying that similar to how we 'custom' set the disk space for Prometheus, we will probably need to make a variable to pass down the chain. First at the statcan-aaw repo we just need to declare and then make the TF_VAR in the git secrets.

This has to get passed down to now the core-platform.

Having said that, if I take a look at chart I can see this. Maybe we can get away with configuring this, and then argocd side we can make changes to the configmap as we see fit? Though I am unsure about what happens if we change the configmap while its going.

^ This almost makes sense if I was going along with what I knew earlier, but a little bit further down below you see configSecret which matches the pictures I have above in terms of location. The "regular" configuration seems to be taken from here, which populates the secret --> the problem with this is that it really only uses .Values.alertmanager.config and nothing else.

`configSecret`

I think we can use this, and if that is the case we do not need to do much variable passing (just need to enable the option) as we ourselves can control the config via argocd with a secret, would just need to restart alertmanager when it is updated(?). TODO:

[ ] Try doing a DRY-RUN helm install and messing with this setting

Jose-Matsuda commented 2 years ago

Relevant Information Regarding Versions

(as of 12/09/2022)

DEV references terraform-statcan-aww-platform v3.7.0, which references terraform-statcan-kubernetes-core-platform v1.7.0, which does reference the terraform-kubernetes-kube-prometheus-stack v2.0.0 (not much here, just focus on the k8s-core-platform as that contains the actual values)

Checking this was important to make sure that we were not missing any key upgrades.

Jose-Matsuda commented 2 years ago

If we go the `secret` route

We will likely want create and make the secret in our various TF file like in

Where data would be something like

We would need to base64 encode whatever configuration we want and then put that as a secret in the repo. Remember that the configuration needs to match what they specify in the docs, (ala that alertmanager.yaml in /etc/alertmanager/config

Jose-Matsuda commented 2 years ago

Actual Configuration to Use

In this gist, format / look of this may change depending on if we go with modifying the secret then we will need to encode it etc.

Jose-Matsuda commented 2 years ago

Closing, will create a new thing to track CNS

StatCan / aaw