canonical / alertmanager-k8s-operator

https://charmhub.io/alertmanager-k8s
Apache License 2.0
5 stars 17 forks source link

Datadog CA cert not available #245

Closed slapcat closed 5 months ago

slapcat commented 5 months ago

Bug Description

When using the datadog receiver, it returns an error about unrecognized certificate authority:

2024-04-26T08:06:37.170Z [alertmanager] ts=2024-04-26T08:06:37.170Z caller=notify.go:732 level=warn component=dispatcher receiver=datadog integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://app.datadoghq.eu/intake/webhook/prometheus?api_key=<API_KEY>\": x509: certificate signed by unknown authority"

To Reproduce

  1. juju deploy cos-lite --trust
  2. Create config file:
    cat > /home/ubuntu/alertmanager.yml <<EOF
    receivers:
    - name: datadog
    webhook_configs:
    - send_resolved: true
    url: https://app.datadoghq.eu/intake/webhook/prometheus?api_key=<API_KEY>
    route:
    group_by: ['alertname']
    group_wait: 10s
    group_interval: 5m
    receiver: datadog
    repeat_interval: 3h
    EOF
  3. juju config alertmanager config_file="@/home/ubuntu/alertmanager.yml"
  4. kubectl logs -n cos alertmanager-0 -c alertmanager

Environment

Model  Controller           Cloud/Region                      Version  SLA          Timestamp

cos    microk8s-controller  snapped-microk8s_cloud/localhost  3.3.4    unsupported  12:26:11Z

App                              Version  Status  Scale  Charm                         Channel        Rev  Address         Exposed  Message

alertmanager                     0.25.0   active      1  alertmanager-k8s              latest/stable   96  10.152.183.148  no       

ca                                        active      1  self-signed-certificates      latest/edge     80  10.152.183.230  no       

catalogue                                 active      1  catalogue-k8s                 latest/edge     34  10.152.183.20   no       

external-ca                               active      1  self-signed-certificates      latest/edge     80  10.152.183.74   no       

grafana                          9.5.3    active      1  grafana-k8s                   latest/stable  105  10.152.183.226  no       

loki                             2.9.4    active      1  loki-k8s                      latest/edge    123  10.152.183.232  no       

prometheus                       2.49.1   active      1  prometheus-k8s                latest/stable  170  10.152.183.37   no       

scrape-interval-config-metrics   n/a      active      1  prometheus-scrape-config-k8s  latest/stable   44  10.152.183.251  no       

scrape-interval-config-monitors  n/a      active      1  prometheus-scrape-config-k8s  latest/stable   44  10.152.183.146  no       

traefik                          2.10.5   active      1  traefik-k8s                   latest/stable  170  10.4.26.228     no       

Unit                                Workload  Agent      Address      Ports  Message

alertmanager/0*                     active    idle       10.1.35.180         

ca/0*                               active    idle       10.1.151.77         

catalogue/0*                        active    idle       10.1.35.179         

external-ca/0*                      active    idle       10.1.151.80         

grafana/0*                          active    idle       10.1.35.183         

loki/0*                             active    idle       10.1.151.82         

prometheus/0*                       active    executing  10.1.151.84         

scrape-interval-config-metrics/0*   active    idle       10.1.35.181         

scrape-interval-config-monitors/0*  active    idle       10.1.35.182         

traefik/0*                          active    idle       10.1.35.184         

Offer                            Application                      Charm                         Rev  Connected  Endpoint                  Interface                Role

alertmanager                     alertmanager                     alertmanager-k8s              96   0/0        karma-dashboard           karma_dashboard          provider

grafana                          grafana                          grafana-k8s                   105  6/6        grafana-dashboard         grafana_dashboard        requirer

loki                             loki                             loki-k8s                      123  5/5        logging                   loki_push_api            provider

prometheus                       prometheus                       prometheus-k8s                170  6/6        metrics-endpoint          prometheus_scrape        requirer

                                                                                                                receive-remote-write      prometheus_remote_write  provider

scrape-interval-config-metrics   scrape-interval-config-metrics   prometheus-scrape-config-k8s  44   1/1        configurable-scrape-jobs  prometheus_scrape        requirer

scrape-interval-config-monitors  scrape-interval-config-monitors  prometheus-scrape-config-k8s  44   1/1        configurable-scrape-jobs  prometheus_scrape        requirer

Relevant log output

2024-04-26T08:06:37.170Z [alertmanager] ts=2024-04-26T08:06:37.170Z caller=notify.go:732 level=warn component=dispatcher receiver=datadog integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://app.datadoghq.eu/intake/webhook/prometheus?api_key=<API_KEY>\": x509: certificate signed by unknown authority"

Additional context

No response

sed-i commented 5 months ago

At first glance this is odd, because the alertmanager rock has root certs.

After installing curl in the alertmanager workload container, curl https://charmhub.io (an https) works fine without --insecure.

Also, both of the following pass verification too from within the workload container:

echo | openssl s_client -strict -verify_return_error -connect charmhub.io:443 || echo "failed"
echo | openssl s_client -strict -verify_return_error -connect app.datadoghq.eu:443 || echo "failed"

According to user accounts (1, 2), alertmanager should be able to talk over TLS.

@slapcat, would you be able to confirm that:

  1. The image in use indeed has certs in place?

    $ juju ssh --container alertmanager am/0 ls -1 /etc/ssl/certs/ | wc -l
    275
  2. Cert validation works from within the workload container?

    $ juju ssh --container alertmanager am/0 bash -c "echo | openssl s_client -strict -verify_return_error -connect app.datadoghq.eu:443" | grep -i verif
    verify return:1
    verify return:1
    verify return:1
    Verification: OK
    Verify return code: 0 (ok)
  3. Which revision of alertmanager is in use? juju status --format=json | jq '.applications.am."charm-rev"'

sed-i commented 5 months ago

Ok from you env I see alertmanager 0.25, charm-rev 96.

@slapcat would you be able to try with a newer revision? The current stable is rev106 and should include the certs fix. @lucabello will soon start the charm promotion train so there should be an even newer stable soon.

sed-i commented 5 months ago

Closing for now. Feel free to reopen if this shows up in rev106 or newer!

slapcat commented 5 months ago

That fixed it, thanks!