cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.45k stars 793 forks source link

Alertmanager template changes are not fully reloaded #5807

Open locmai opened 6 months ago

locmai commented 6 months ago

Describe the bug With a Cortex helm chart in our Kubernetes cluster - and a sidecar in the alertmanager pod to continuously check the changes from configmap then synchronize the templates in to our /data/fake/templates directory, the template files are updated but the changes are not fully reflected in the messages.

To Reproduce Steps to reproduce the behavior (note: fake is the dummy tenant name):

  1. Start a minimal Cortex with alertmanager and sidecar
  2. Define a template file example.gotmpl (example in additional context)
  3. Let the alertmanager reload the configuration (log message: https://github.com/cortexproject/cortex/blob/9bc04ce3930b045480d72ab9712d3271c70c02ee/pkg/alertmanager/multitenant.go#L689)
  4. Then change the template in the configmap in the changeme part
  5. Let the alertmanager reload the configuration again (same log as step 3)
  6. Check the directory `/data/fake/templates/example.gotmpl' - the file will have the changes
  7. Use amtool to simulate an alert

Expected behavior

The changes would be reflected in the simulated alert sent by the amtool

Actual behavior

The old template is still being used

Environment:

Additional Context

Alertmanager configuration:

receivers:
  - name: 'team-1'
    slack_configs:
      - channel: '#team1'
        send_resolved: true
        title: '{{ template "__alert_title" . }}'
        text: |-
          Title :{{ template "__alert_title" . }}
templates:
  - 'example.gotmpl'

example.gotmpl :

{{ define "__alert_title" -}}
   {{ .CommonLabels.alertname }} - changeme
{{- end }}

We tried calling the /api/v1/alerts endpoint which gives us the updated template, and the log message indicates that the loadAndSyncConfigs is actually ran.

I've traced through the function from loadAndSyncConfigs -> setConfig where: https://github.com/cortexproject/cortex/blob/9bc04ce3930b045480d72ab9712d3271c70c02ee/pkg/alertmanager/multitenant.go#L861C3-L861C68

this line seems to compare the templates from the loaded/updated cfg with the template in the store (with the templateFilePath) which is already updated via the sidecar's mechanism.

yeya24 commented 6 months ago

Need to take a look and see if we can reproduce this issue.

dpericaxon commented 3 months ago

@yeya24 were you able to see this issue? We have a few users trying to make template changes and right now to remediate we have to restart the alertmanager pods. But we have multiple environments so that can be a bit burdensome especially if the template has an issue and needs to be rolled back etc.

yeya24 commented 3 months ago

Hey @rapphil, @rajagopalanand, can you guys maybe help take a look at the issue?

dpericaxon commented 3 months ago

Hey @rapphil, @rajagopalanand did you get a chance to take a look at this possibly? Since we have multiple environments we currently have to do a rolling restart in every environment for every template change.

rajagopalanand commented 3 months ago

Not yet. I will try and find some time to look this week

rapphil commented 2 months ago

I'm taking a look into this issue. Right now I'm trying to reproduce using the helm charts and a local cluster.

rapphil commented 2 months ago

Hi, I was not able to reproduce your issue:

Having said that, here are a couple of questions:

This is the payload that was passed to the echo server:

{"name":"echo-server","hostname":"echo-server-5fb75ccd64-bqkz5","pid":1,"level":30,"host":{"hostname":"echo-server.cortex","ip":"::ffff:10.244.0.116","ips":[]},"http":{"method":"POST","baseUrl":"","originalUrl":"/","protocol":"http"},"request":{"params":{},"query":{},"cookies":[],"body":{"channel":"#channel1","username":"Alertmanager","attachments":[{"title":"my_alert_confmap - changeme","title_link":"/api/prom/alertmanager/#/alerts?receiver=slack-config","text":"Title :my_alert_confmap - changeme","fallback":"[FIRING:1]  (my_alert_confmap my_instance my_cron_job warning) | /api/prom/alertmanager/#/alerts?receiver=slack-config","callback_id":"","footer":"","color":"danger","mrkdwn_in":["fallback","pretext","text"]}]},"headers":{"host":"echo-server.cortex","user-agent":"Alertmanager/","content-length":"439","content-type":"application/json"}},"msg":"Fri, 19 Jul 2024 20:30:58 GMT | [POST] - http://echo-server.cortex/","time":"2024-07-19T20:30:58.213Z","v":0}

Here is the full configmap that I used for alertmanager:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /data/
  labels:
    cortex_alertmanager: "1"
  name: alertmanager-example-config
  namespace: cortex
data:
  fake.yaml: |-
    route:
      group_wait: 30s
      group_interval: 10s
      receiver: slack-config
    receivers:
    - name: 'slack-config'
      slack_configs:
        - send_resolved: true
          api_url: 'http://echo-server.cortex'
          channel: "#channel1"
          title: '{{ template "__alert_title" . }}'
          text: 'Title :{{ template "__alert_title" . }}'
    templates:
    - 'template.gotmpl'

for the templates:

apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    k8s-sidecar-target-directory: /data/fake/templates/
  labels:
    cortex_alertmanager: "1"
  name: alertmanager-example-template
  namespace: cortex
data:
  template.gotmpl: |-
   {{ define "__alert_title" -}}
     {{ .CommonLabels.alertname }} - changeme
   {{- end }}

Also notice that the side car is only functional if you use local storage.

locmai commented 2 months ago

Hey @rapphil , thanks for taking a look at this issue.

Also notice that the side car is only functional if you use local storage.

Yes, we are using the local storage backend for alertmanager. Here is our alertmanager's configuration, it could be quite outdated since we had it from the beginning setup till now:

alertmanager:
  external_url: /api/prom/alertmanager
  enable_api: true
  data_dir: /data/
alertmanager_storage:
  backend: local
  local:
    path: /data

When you try to access the alertmanager endpoint /multitenant_alertmanager/configs do you see a correct configuration? is the configuration what you are expecting?

I tested that previously and it returned the unchanged/non-update configuration. But your test is much simpler, I will try to reproduce the same way and update the result here.