Open locmai opened 6 months ago
Need to take a look and see if we can reproduce this issue.
@yeya24 were you able to see this issue? We have a few users trying to make template changes and right now to remediate we have to restart the alertmanager pods. But we have multiple environments so that can be a bit burdensome especially if the template has an issue and needs to be rolled back etc.
Hey @rapphil, @rajagopalanand, can you guys maybe help take a look at the issue?
Hey @rapphil, @rajagopalanand did you get a chance to take a look at this possibly? Since we have multiple environments we currently have to do a rolling restart in every environment for every template change.
Not yet. I will try and find some time to look this week
I'm taking a look into this issue. Right now I'm trying to reproduce using the helm charts and a local cluster.
Hi, I was not able to reproduce your issue:
Having said that, here are a couple of questions:
/multitenant_alertmanager/configs
do you see a correct configuration? is the configuration what you are expecting?
This is what I'm getting when running my tests:
fake:
template_files:
template.gotmpl: |-
{{ define "__alert_title" -}}
{{ .CommonLabels.alertname }} - changeme
{{- end }}
alertmanager_config: |-
route:
group_wait: 30s
group_interval: 10s
receiver: slack-config
receivers:
- name: 'slack-config'
slack_configs:
- send_resolved: true
api_url: 'http://echo-server.cortex'
channel: "#channel1"
title: '{{ template "__alert_title" . }}'
text: 'Title :{{ template "__alert_title" . }}'
templates:
- 'template.gotmpl'
This is the payload that was passed to the echo server:
{"name":"echo-server","hostname":"echo-server-5fb75ccd64-bqkz5","pid":1,"level":30,"host":{"hostname":"echo-server.cortex","ip":"::ffff:10.244.0.116","ips":[]},"http":{"method":"POST","baseUrl":"","originalUrl":"/","protocol":"http"},"request":{"params":{},"query":{},"cookies":[],"body":{"channel":"#channel1","username":"Alertmanager","attachments":[{"title":"my_alert_confmap - changeme","title_link":"/api/prom/alertmanager/#/alerts?receiver=slack-config","text":"Title :my_alert_confmap - changeme","fallback":"[FIRING:1] (my_alert_confmap my_instance my_cron_job warning) | /api/prom/alertmanager/#/alerts?receiver=slack-config","callback_id":"","footer":"","color":"danger","mrkdwn_in":["fallback","pretext","text"]}]},"headers":{"host":"echo-server.cortex","user-agent":"Alertmanager/","content-length":"439","content-type":"application/json"}},"msg":"Fri, 19 Jul 2024 20:30:58 GMT | [POST] - http://echo-server.cortex/","time":"2024-07-19T20:30:58.213Z","v":0}
Here is the full configmap that I used for alertmanager:
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
k8s-sidecar-target-directory: /data/
labels:
cortex_alertmanager: "1"
name: alertmanager-example-config
namespace: cortex
data:
fake.yaml: |-
route:
group_wait: 30s
group_interval: 10s
receiver: slack-config
receivers:
- name: 'slack-config'
slack_configs:
- send_resolved: true
api_url: 'http://echo-server.cortex'
channel: "#channel1"
title: '{{ template "__alert_title" . }}'
text: 'Title :{{ template "__alert_title" . }}'
templates:
- 'template.gotmpl'
for the templates:
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
k8s-sidecar-target-directory: /data/fake/templates/
labels:
cortex_alertmanager: "1"
name: alertmanager-example-template
namespace: cortex
data:
template.gotmpl: |-
{{ define "__alert_title" -}}
{{ .CommonLabels.alertname }} - changeme
{{- end }}
Also notice that the side car is only functional if you use local storage.
Hey @rapphil , thanks for taking a look at this issue.
Also notice that the side car is only functional if you use local storage.
Yes, we are using the local storage backend for alertmanager. Here is our alertmanager's configuration, it could be quite outdated since we had it from the beginning setup till now:
alertmanager:
external_url: /api/prom/alertmanager
enable_api: true
data_dir: /data/
alertmanager_storage:
backend: local
local:
path: /data
When you try to access the alertmanager endpoint /multitenant_alertmanager/configs do you see a correct configuration? is the configuration what you are expecting?
I tested that previously and it returned the unchanged/non-update configuration. But your test is much simpler, I will try to reproduce the same way and update the result here.
Describe the bug With a Cortex helm chart in our Kubernetes cluster - and a sidecar in the alertmanager pod to continuously check the changes from configmap then synchronize the templates in to our
/data/fake/templates
directory, the template files are updated but the changes are not fully reflected in the messages.To Reproduce Steps to reproduce the behavior (note: fake is the dummy tenant name):
example.gotmpl
(example in additional context)amtool
to simulate an alertExpected behavior
The changes would be reflected in the simulated alert sent by the
amtool
Actual behavior
The old template is still being used
Environment:
Additional Context
Alertmanager configuration:
example.gotmpl :
We tried calling the
/api/v1/alerts
endpoint which gives us the updated template, and the log message indicates that theloadAndSyncConfigs
is actually ran.I've traced through the function from
loadAndSyncConfigs
->setConfig
where: https://github.com/cortexproject/cortex/blob/9bc04ce3930b045480d72ab9712d3271c70c02ee/pkg/alertmanager/multitenant.go#L861C3-L861C68this line seems to compare the templates from the loaded/updated cfg with the template in the store (with the templateFilePath) which is already updated via the sidecar's mechanism.