cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.46k stars 794 forks source link

Alertmanager configuration load with error #4080

Closed mizunos closed 3 years ago

mizunos commented 3 years ago

Describe the bug I manage to load alertmanager configuration yaml to Cortex running on GKE via curl command. I encounter 2 issues Greatly appreciate any help on this.

1 - I found the following error in the log that I have no idea what they are

level=warn ts=2021-04-15T01:50:06.693667893Z caller=multitenant.go:490 component=MultiTenantAlertmanager msg="error while synchronizing alertmanager configs" err="proto: wrong wireType = 7 for field Templates"                         │
│ level=debug ts=2021-04-15T01:50:09.168065745Z caller=logging.go:66 traceID=741f72798c13fec9 msg="GET /metrics (200) 5.92689ms"                                                                                                            │
│ level=info ts=2021-04-15T01:50:16.510737959Z caller=multitenant.go:508 component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"                                                                               │
│ level=debug ts=2021-04-15T01:50:16.575067049Z caller=nflog.go:336 user="0 Content-Type:application/yaml" component=nflog msg="Running maintenance"                                                                                        │
│ level=debug ts=2021-04-15T01:50:16.575166228Z caller=nflog.go:338 user="0 Content-Type:application/yaml" component=nflog msg="Maintenance done" duration=109.645µs size=0                                                                 │
│ level=error ts=2021-04-15T01:50:16.575191109Z caller=nflog.go:365 user="0 Content-Type:application/yaml" component=nflog msg="Running maintenance failed" err="open data/nflog:0 Content-Type:application/yaml.1c3c21ac7b03829: no such f │
│ ile or directory"

2 - When I try to get the config back using GET api/v1/rules, it cannot bring back anything even though the logs showed the configuration is there

 http http://alertmanager.monitoring.svc.cluster.local/api/v1/alerts X-Scope-OrgID:0 Content-Type:application/yaml
HTTP/1.1 404 Not Found
Content-Length: 30
Content-Type: text/plain; charset=utf-8
Date: Thu, 15 Apr 2021 01:57:01 GMT
X-Content-Type-Options: nosniff

alertmanager config not found

The above was run from a pod within the same cluster where alertmanager resides using httpie tool (similar to curl)

Environment:

Storage Engine

**Additional info

Alertmanager config file in yaml

template_files:
  default_template: |
    {{ define "__alertmanager" }}AlertManager{{ end }}
    {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }}{{ end }}
alertmanager_config: |
  global:
    resolve_timeout: 10m
  templates:
    - 'default_template'
  route:
    group_wait: 10s
    group_interval: 1m 
    repeat_interval: 10m
    receiver: slack_receiver
    routes:
      - receiver: "slack_receiver"
        match_re:
          severity: critical|warning
        continue: true

  receivers:
  - name: slack_receiver
    slack_configs:
    - send_resolved: true
      api_url: '<>'
      channel: '#test-gw-alert'
      text: "{{ range .Alerts }}<!channel> {{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}"

K8s alertmanager deployment

        args:
        - -target=alertmanager
        - -log.level=debug
        - -server.http-listen-port=80
        - -alertmanager.storage.retention=240h
        - -alertmanager.configs.poll-interval=10s
        - -alertmanager.configs.url=http://configs.monitoring.svc.cluster.local:80
        - -alertmanager.web.external-url=/api/prom/alertmanager
        - -experimental.alertmanager.enable-api=true
        - -alertmanager.sharding-ring.store=consul
        - -alertmanager.sharding-ring.consul.hostname=consul.monitoring.svc.cluster.local:8500
        - -alertmanager.sharding-enabled=true
        - -alertmanager.sharding-ring.prefix=alertmanagers/
        - -alertmanager.storage.type=gcs
        - -alertmanager.storage.gcs.bucketname=edge-monitor-block-storage
        - -alertmanager.storage.path=data/
        - -experimental.alertmanager.enable-api=true
pstibrany commented 3 years ago

│ level=error ts=2021-04-15T01:50:16.575191109Z caller=nflog.go:365 user="0 Content-Type:application/yaml" component=nflog msg="Running maintenance failed" err="open data/nflog:0 Content-Type:application/yaml.1c3c21ac7b03829: no such f │

Is this user correct? It shows 0 Content-Type:application/yaml.

mizunos commented 3 years ago

We have only 1 user right now on the system to represent everybody we are working with so user 0 is the first one. We did not assigned user ID. That was what cortex reported on all services so I assume that would be it.

mizunos commented 3 years ago

Delete all configuration in storage and re-deploy alertmanager again to try to load configuration using cortextool instead. Something must have gone wrong when I was using curl to load it. I was able to get it to tak the config for the correct user id 0 etc. bu then after a while the log started to pop up with another error

alertmanager-55d8564b6f-qp4nf:level=debug ts=2021-04-16T20:10:22.735592697Z caller=logging.go:66 traceID=61340d26e9f2ff16 msg="GET /metrics (200) 6.244629ms" 

alertmanager-55d8564b6f-6wfsh:level=info ts=2021-04-16T20:10:28.245752724Z caller=multitenant.go:508 component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"

alertmanager-55d8564b6f-6wfsh:level=warn ts=2021-04-16T20:10:28.366973317Z caller=multitenant.go:490 component=MultiTenantAlertmanager msg="error while synchronizing alertmanager configs" err="proto: wrong wireType = 7 for field Templates" 

alertmanager-55d8564b6f-b8wcz:level=info ts=2021-04-16T20:10:29.112359299Z caller=multitenant.go:508 component=MultiTenantAlertmanager msg="synchronizing alertmanager configs for users"                                                                                           

alertmanager-55d8564b6f-b8wcz:level=warn ts=2021-04-16T20:10:29.181200533Z caller=multitenant.go:490 component=MultiTenantAlertmanager msg="error while synchronizing alertmanager configs" err="proto: wrong wireType = 7 for field Templates"

I deployed 3 alertmanagers with consul as the kv store. Not sure what is the issue or is it an issue as this is a warning

pracucci commented 3 years ago

We have only 1 user right now on the system to represent everybody we are working with so user 0 is the first one. We did not assigned user ID. That was what cortex reported on all services so I assume that would be it.

The user ID we see in the logs is 0 Content-Type:application/yaml. Looks like Content-Type:application/yaml somewhat slipped in the X-Scope-OrgID HTTP request header.

pracucci commented 3 years ago

then after a while the log started to pop up with another error

As discussed in #4093, please make sure to store blocks in a bucket different than the one used for rules and alertmanager configs. Would be great if you could open a PR to improve the doc!

mizunos commented 3 years ago

Ok. Will do about updating the doc about Alertmanager. Will be glad to help. I also have a set of K8s deployment files for block storage configuration that works for me ( at least so far) on GCP. I am open to share it if it helps everybody else. You guys may be able to spot where my deployment cause issue

pracucci commented 3 years ago

Ok. Will do about updating the doc about Alertmanager. Will be glad to help

Thanks!