cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.43k stars 788 forks source link

Error sending alert: bad response status 422 Unprocessable Entity #6053

Open mousimin opened 2 months ago

mousimin commented 2 months ago

Describe the bug We are running micro services for cortex, we were using v1 version for alertmanager api by specifying flag -ruler.alertmanager-use-v2=false(used cortex v1.16.0), now we upgrade cortex to v1.17.1, from the log, I see we are using v2 version for alertmanager, when I create some alert rules, I see the alerts fire, but we can't get any email notification, meanwhile we are getting some error messages like: caller=notifier.go:544 level=error user=Test alertmanager=https://cortex-alertmanager.org/alertmanager/api/v2/alerts count=1 msg="Error sending alert" err="bad response status 422 Unprocessable Entity"

To Reproduce Steps to reproduce the behavior:

  1. Start Cortex (SHA or version): start cortex v1.17.1 with micro service mode
  2. Perform Operations(Read/Write/Others): create alert rule and observe the logs of ruler

Expected behavior we should get the notifications and no error log should appear.

Environment:

Additional Context configuration file for cortex ruler:

ExecStart=/usr/sbin/cortex-1.17.1 \
  -auth.enabled=true \
  -log.level=info \
  -config.file=/etc/cortex-ruler/cortex-ruler.yaml \
  -runtime-config.file=/etc/cortex-shared/cortex-runtime.yaml \
  -server.http-listen-port=8061 \
  -server.grpc-listen-port=9061 \
  -server.grpc-max-recv-msg-size-bytes=104857600 \
  -server.grpc-max-send-msg-size-bytes=104857600 \
  -server.grpc-max-concurrent-streams=1000 \
  \
  -distributor.sharding-strategy=shuffle-sharding \
  -distributor.ingestion-tenant-shard-size=12 \
  -distributor.replication-factor=2 \
  -distributor.shard-by-all-labels=true \
  -distributor.zone-awareness-enabled=true \
  \
  -store.engine=blocks \
  -blocks-storage.backend=s3 \
  -blocks-storage.s3.endpoint=s3.org:10444 \
  -blocks-storage.s3.bucket-name=staging-metrics \
  -blocks-storage.s3.insecure=false \
  \
  -blocks-storage.bucket-store.sync-dir=/local/cortex-ruler/tsdb-sync \
  -blocks-storage.bucket-store.metadata-cache.backend=memcached \
  -blocks-storage.bucket-store.metadata-cache.memcached.addresses=100.76.51.1:11211,100.76.51.2:11211,100.76.51.3:11211 \
  \
  -querier.active-query-tracker-dir=/local/cortex-ruler/active-query-tracker \
  -querier.ingester-streaming=true \
  -querier.query-store-after=23h \
  -querier.query-ingesters-within=24h \
  -querier.shuffle-sharding-ingesters-lookback-period=25h \
  \
  -store-gateway.sharding-enabled=true \
  -store-gateway.sharding-strategy=shuffle-sharding \
  -store-gateway.tenant-shard-size=6 \
  -store-gateway.sharding-ring.store=etcd \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.1:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.2:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.3:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.4:2379 \
  -store-gateway.sharding-ring.etcd.endpoints=10.120.121.5:2379 \
  -store-gateway.sharding-ring.prefix=cortex-store-gateways/ \
  -store-gateway.sharding-ring.replication-factor=2 \
  -store-gateway.sharding-ring.zone-awareness-enabled=true \
  -store-gateway.sharding-ring.instance-availability-zone=t1 \
  -store-gateway.sharding-ring.wait-stability-min-duration=1m \
  -store-gateway.sharding-ring.wait-stability-max-duration=5m \
  -store-gateway.sharding-ring.instance-addr=100.76.75.1 \
  -store-gateway.sharding-ring.instance-id=s_8061 \
  -store-gateway.sharding-ring.heartbeat-period=15s \
  -store-gateway.sharding-ring.heartbeat-timeout=1m \
  \
  -ring.store=etcd \
  -ring.prefix=cortex-ingesters/ \
  -ring.heartbeat-timeout=1m \
  -etcd.endpoints=10.120.119.1:2379 \
  -etcd.endpoints=10.120.119.2:2379 \
  -etcd.endpoints=10.120.119.3:2379 \
  -etcd.endpoints=10.120.119.4:2379 \
  -etcd.endpoints=10.120.119.5:2379 \
  \
  -ruler.enable-sharding=true \
  -ruler.sharding-strategy=shuffle-sharding \
  -ruler.tenant-shard-size=2 \
  -ruler.ring.store=etcd \
  -ruler.ring.prefix=cortex-rulers/ \
  -ruler.ring.num-tokens=32 \
  -ruler.ring.heartbeat-period=15s \
  -ruler.ring.heartbeat-timeout=1m \
  -ruler.ring.etcd.endpoints=10.120.119.1:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.2:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.3:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.4:2379 \
  -ruler.ring.etcd.endpoints=10.120.119.5:2379 \
  -ruler.ring.instance-id=s_8061 \
  -ruler.ring.instance-interface-names=e1 \
  \
  -ruler.max-rules-per-rule-group=500 \
  -ruler.max-rule-groups-per-tenant=5000 \
  \
  -ruler.external.url=staging-cortex-ruler.org \
  -ruler.client.grpc-max-recv-msg-size=104857600 \
  -ruler.client.grpc-max-send-msg-size=16777216 \
  -ruler.client.grpc-compression= \
  -ruler.client.grpc-client-rate-limit=0 \
  -ruler.client.grpc-client-rate-limit-burst=0 \
  -ruler.client.backoff-on-ratelimits=false \
  -ruler.client.backoff-min-period=500ms \
  -ruler.client.backoff-max-period=10s \
  -ruler.client.backoff-retries=5 \
  -ruler.evaluation-interval=15s \
  -ruler.poll-interval=15s \
  -ruler.rule-path=/local/cortex-ruler/rules \
  -ruler.alertmanager-url=https://staging-cortex-alertmanager.org/alertmanager \
  -ruler.alertmanager-discovery=false \
  -ruler.alertmanager-refresh-interval=1m \
  -ruler.notification-queue-capacity=10000 \
  -ruler.notification-timeout=10s \
  -ruler.flush-period=1m \
  -experimental.ruler.enable-api=true \
  \
  -ruler-storage.backend=s3 \
  -ruler-storage.s3.endpoint=s3.org:10444 \
  -ruler-storage.s3.bucket-name=staging-rules \
  -ruler-storage.s3.insecure=false \
  \
  -target=ruler

configuration file for cortex alertmanager:

ExecStart=/usr/sbin/cortex-1.17.1 \
  -auth.enabled=true \
  -log.level=info \
  -config.file=/etc/cortex-alertmanager-8071/cortex-alertmanager.yaml \
  -runtime-config.file=/etc/cortex-shared/cortex-runtime.yaml \
  -server.http-listen-port=8071 \
  -server.grpc-listen-port=9071 \
  -server.grpc-max-recv-msg-size-bytes=104857600 \
  -server.grpc-max-send-msg-size-bytes=104857600 \
  -server.grpc-max-concurrent-streams=1000 \
  \
  -alertmanager.storage.path=/local/cortex-alertmanager-8071/data \
  -alertmanager.storage.retention=120h \
  -alertmanager.web.external-url=https://staging-cortex-alertmanager.org/alertmanager \
  -alertmanager.configs.poll-interval=1m \
  -experimental.alertmanager.enable-api=true \
  \
  -alertmanager.sharding-enabled=true \
  -alertmanager.sharding-ring.store=etcd \
  -alertmanager.sharding-ring.prefix=cortex-alertmanagers/ \
  -alertmanager.sharding-ring.heartbeat-period=15s \
  -alertmanager.sharding-ring.heartbeat-timeout=1m \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.1:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.2:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.3:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.4:2379 \
  -alertmanager.sharding-ring.etcd.endpoints=10.120.121.5:2379 \
  -alertmanager.sharding-ring.instance-id=b_8071 \
  -alertmanager.sharding-ring.instance-interface-names=e1 \
  -alertmanager.sharding-ring.replication-factor=2 \
  -alertmanager.sharding-ring.zone-awareness-enabled=true \
  -alertmanager.sharding-ring.instance-availability-zone=t1 \
  \
  -alertmanager-storage.backend=s3 \
  -alertmanager-storage.s3.endpoint=s3.org:10444 \
  -alertmanager-storage.s3.bucket-name=staging-alerts \
  -alertmanager-storage.s3.insecure=false \
  \
  -alertmanager.receivers-firewall-block-cidr-networks=10.163.131.164/28,10.163.131.180/28 \
  -alertmanager.receivers-firewall-block-private-addresses=true \
  -alertmanager.notification-rate-limit=0 \
  -alertmanager.max-config-size-bytes=0 \
  -alertmanager.max-templates-count=0 \
  -alertmanager.max-template-size-bytes=0 \
  \
  -target=alertmanager

the configuration for alertmanager:

template_files:
  default_template: |
    {{ define "__alertmanager" }}AlertManager{{ end }}
    {{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver | urlquery }}{{ end }}
alertmanager_config: |
  global:
    smtp_smarthost: 'yourmailhost'
    smtp_from: 'youraddress'
    smtp_require_tls: false
  templates:
    - 'default_template'
  route:
    receiver: example-email
  receivers:
    - name: example-email
      email_configs:
      - to: 'youraddress'
mousimin commented 1 month ago

Hi @friedrichg & @yeya24 , I guess the error message "bad response status 422 Unprocessable Entity" was from altermanager, right? But I didn't find any error log from altermanager even I used debug log level, any suggestion from you will be appreciated!

mousimin commented 1 month ago

I want to answer myself so that the others can refer to it. I manually sent the HTTP request using curl and got the detailed response from alertmanager: maxFailure (quorum) on a given error family, rpc error: code = Code(422) desc = addr=10.120.131.81:9071 state=ACTIVE zone=z1, rpc error: code = Code(422) desc = {"code":601,"message":"0.generatorURL in body must be of type uri: \"staging-cortex-ruler.org/graph?g0.expr=up%7Bapp%3D%22cert-manager%22%7D+%3E+0\u0026g0.tab=1\""} So I added the schema "https://" at the beginning of the value -ruler.external.url and then it worked.

Map this to the code:

func (n *Manager) sendOne(ctx context.Context, c *http.Client, url string, b []byte) error {
    req, err := http.NewRequest(http.MethodPost, url, bytes.NewReader(b))
    if err != nil {
        return err
    }
    req.Header.Set("User-Agent", userAgent)
    req.Header.Set("Content-Type", contentTypeJSON)
    resp, err := n.opts.Do(ctx, c, req)
    if err != nil {
        return err
    }
    defer func() {
        io.Copy(io.Discard, resp.Body)
        resp.Body.Close()
    }()

    // Any HTTP status 2xx is OK.
    //nolint:usestdlibvars
    if resp.StatusCode/100 != 2 {
        return fmt.Errorf("bad response status %s", resp.Status)
    }

    return nil
}

Maybe we should add the response body into the error message as well? Currently we only add the status which is not easy for debugging.

rapphil commented 1 month ago

@friedrichg @yeya24 should we go ahead and start logging the body of the response? It makes sense IMHO.

yeya24 commented 1 month ago

@rapphil Agree. Would you like to work on it? Just want to make sure AM doesn't send something crazy in the response body. Maybe we can truncate the message with a limit