cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.45k stars 793 forks source link

Alertmanager template file updated by alertmanager API is removed periodically #5463

Open yk-zheng-zz opened 1 year ago

yk-zheng-zz commented 1 year ago

Describe the bug we configured alertmanager with a template and updated templates and configuration of alertmanager by api we uploaded successfully and I can find the template files in container but they are removed and recreated about every 15 seconds and it caused an error "template not defined" [alertmanager-1] level=error ts=2023-07-18T06:43:14.625788498Z caller=dispatch.go:352 component=MultiTenantAlertmanager user=test component=dispatcher component=dispatcher msg="Notify for alerts failed" num_alerts=2 err="alert-warning/slack[0]: notify retry canceled due to unrecoverable error after 1 attempts: template: :1:12: executing \"\" at <{{template \"slack.test.title\" .}}>: template \"slack.test.title\" not defined"

/data/test/templates # date; ls -lp
Tue Jul 18 07:05:50 UTC 2023
total 0
/data/test/templates # date; ls -lp
Tue Jul 18 07:05:51 UTC 2023
total 0
/data/test/templates # date; ls -lp
Tue Jul 18 07:05:53 UTC 2023
total 0
/data/test/templates # date; ls -lp
Tue Jul 18 07:05:54 UTC 2023
total 8
-rw-r--r--    1 root     root           862 Jul 18 07:05 test_slack.tmpl
-rw-r--r--    1 root     root           634 Jul 18 07:05 test_email.tmpl
/data/test/templates # date; ls -lp
Tue Jul 18 07:05:58 UTC 2023
total 8
-rw-r--r--    1 root     root           862 Jul 18 07:05 test_slack.tmpl
-rw-r--r--    1 root     root           634 Jul 18 07:05 test_email.tmpl
/data/test/templates # date; ls -lp
Tue Jul 18 07:05:59 UTC 2023
total 8
-rw-r--r--    1 root     root           862 Jul 18 07:05 test_slack.tmpl
-rw-r--r--    1 root     root           634 Jul 18 07:05 test_email.tmpl
/data/test/templates # date; ls -lp
Tue Jul 18 07:06:02 UTC 2023
total 8
-rw-r--r--    1 root     root           862 Jul 18 07:05 test_slack.tmpl
-rw-r--r--    1 root     root           634 Jul 18 07:05 test_email.tmpl
/data/test/templates # date; ls -lp
Tue Jul 18 07:06:08 UTC 2023
total 8
-rw-r--r--    1 root     root           862 Jul 18 07:05 test_slack.tmpl
-rw-r--r--    1 root     root           634 Jul 18 07:05 test_email.tmpl
/data/test/templates # date; ls -lp
Tue Jul 18 07:06:12 UTC 2023
total 0
/data/test/templates # date; ls -lp
Tue Jul 18 07:06:17 UTC 2023
total 0
/data/test/templates # date; ls -lp
Tue Jul 18 07:12:44 UTC 2023
total 0
/data/test/templates #

To Reproduce Steps to reproduce the behavior:

  1. Start Cortex (SHA or version) v1.15.3
  2. Perform Operations(Read/Write/Others) N/A

Expected behavior template file should keep persisting in container

Environment:

Additional Context

configuration related to alertmanager in cortex.yaml

alertmanager:
  enable_api: true
  external_url: /alertmanager
  cluster:
    peers: alertmanager-0.alertmanager:9094,alertmanager-1.alertmanager:9094,alertmanager-2.alertmanager:9094
friedrichg commented 1 year ago

was this working before on v1.14.1 ? or is this the first time you try it?

yk-zheng-zz commented 1 year ago

it worked fine on v1.11.0 I didn't try this on other versions before v1.14.1

ShunjiTakano commented 1 year ago

Hi @friedrichg, I am working with @yk-zheng-zz.

For more context, we were able to resolve the issue by adding in this to the alertmanager config. (seen in this issue https://github.com/cortexproject/cortex-helm-chart/issues/463)

configs:
  alertmanager:
    data_dir: /data