canonical / cos-lite-bundle

https://charmhub.io/cos-lite
Apache License 2.0
10 stars 10 forks source link

Integration tests keep failing because prometheus fails to scrape grafana #91

Open sed-i opened 9 months ago

sed-i commented 9 months ago

Description

Integrations tests keep failing on

https://github.com/canonical/cos-lite-bundle/blob/83a5062d0b87aa8a301d990a34acf1055b60a21d/tests/integration/test_bundle.py#L248-L250

AssertionError: assert {'up', 'down'} == {'up'}
  Extra items in the left set:
  'down'
  Full diff:
  - {'up'}
  + {'up', 'down'}

because prometheus fails to scrape grafana

      {
        "discoveredLabels": {
          "__address__": "grafana-0.grafana-endpoints.test-bundle-8n1r.svc.cluster.local:3000",
          "__metrics_path__": "/metrics",
          "__scheme__": "https",
          "__scrape_interval__": "1m",
          "__scrape_timeout__": "10s",
          "job": "juju_test-bundle-8n1r_e252fd59_grafana_prometheus_scrape",
          "juju_application": "grafana",
          "juju_charm": "grafana-k8s",
          "juju_model": "test-bundle-8n1r",
          "juju_model_uuid": "e252fd59-3737-4887-80af-f8a9c426125a"
        },
        "labels": {
          "instance": "test-bundle-8n1r_e252fd59-3737-4887-80af-f8a9c426125a_grafana",
          "job": "juju_test-bundle-8n1r_e252fd59_grafana_prometheus_scrape",
          "juju_application": "grafana",
          "juju_charm": "grafana-k8s",
          "juju_model": "test-bundle-8n1r",
          "juju_model_uuid": "e252fd59-3737-4887-80af-f8a9c426125a"
        },
        "scrapePool": "juju_test-bundle-8n1r_e252fd59_grafana_prometheus_scrape",
        "scrapeUrl": "https://grafana-0.grafana-endpoints.test-bundle-8n1r.svc.cluster.local:3000/metrics",
        "globalUrl": "https://grafana-0.grafana-endpoints.test-bundle-8n1r.svc.cluster.local:3000/metrics",
        "lastError": "Get \"https://10.43.8.206/test-bundle-8n1r-grafana/metrics\": tls: failed to verify certificate: x509: certificate signed by unknown authority",
        "lastScrape": "2024-01-16T18:37:16.119998201Z",
        "lastScrapeDuration": 0.005085648,
        "health": "down",
        "scrapeInterval": "1m",
        "scrapeTimeout": "10s"
      },

Potential issue

There's a TLS error tls: failed to verify certificate: x509: certificate signed by unknown authority. All scrape targets in the test are behind TLS, but only grafana fails:

$ curl -sk https://10.1.166.115:9090/api/v1/targets | jq | grep https
          "__scheme__": "https",
        "scrapeUrl": "https://alertmanager-0.alertmanager-endpoints.test-bundle-8n1r.svc.cluster.local:9093/metrics",
        "globalUrl": "https://alertmanager-0.alertmanager-endpoints.test-bundle-8n1r.svc.cluster.local:9093/metrics",
          "__scheme__": "https",
        "scrapeUrl": "https://grafana-0.grafana-endpoints.test-bundle-8n1r.svc.cluster.local:3000/metrics",
        "globalUrl": "https://grafana-0.grafana-endpoints.test-bundle-8n1r.svc.cluster.local:3000/metrics",
        "lastError": "Get \"https://10.43.8.206/test-bundle-8n1r-grafana/metrics\": tls: failed to verify certificate: x509: certificate signed by unknown authority",
          "__scheme__": "https",
        "scrapeUrl": "https://loki-0.loki-endpoints.test-bundle-8n1r.svc.cluster.local:3100/metrics",
        "globalUrl": "https://loki-0.loki-endpoints.test-bundle-8n1r.svc.cluster.local:3100/metrics",
          "__scheme__": "https",
        "scrapeUrl": "https://prometheus-0.prometheus-endpoints.test-bundle-8n1r.svc.cluster.local:9090/metrics",
        "globalUrl": "https://prometheus-0.prometheus-endpoints.test-bundle-8n1r.svc.cluster.local:9090/metrics",
$ curl -sk https://10.1.166.115:9090/api/v1/targets | jq | grep health
        "health": "up",
        "health": "up",
        "health": "up",
        "health": "down",
        "health": "up",
        "health": "up",
        "health": "up",

Perhaps this is related to the grafana 9 vs grafana 10 ingress+redirect issue. Could retry after grafana 9.5.3 rock is published by oci-factory, and grafana metadata update to point there.

lucabello commented 1 week ago

We're not sure if this still happens, we should examine the last runs to verify.