canonical / prometheus-k8s-operator

https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 35 forks source link

Scrape targets disappear after reprovisioning #414

Closed sed-i closed 1 year ago

sed-i commented 1 year ago

Bug Description

I had the load test running for a while, and then something went wrong and all the charms went through start -> pebble-ready.

unit-grafana-0: 2022-12-08 01:05:14 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [40652e] "unit-grafana-0" cannot open api: cannot resolve "controller-service.controller-uk8s.svc.cluster.local": lookup controller-service.controller-uk8s.svc.cluster.local: i/o timeout
unit-grafana-0: 2022-12-08 01:05:32 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
unit-grafana-0: 2022-12-08 01:05:46 WARNING juju.worker.uniter.context could not retrieve the cloud API version: unable to determine legacy status for namespace cos-lite-load-test: Get "https://10.152.183.1:443/api/v1/namespaces/cos-lite-load-test": net/http: TLS handshake timeout
unit-grafana-0: 2022-12-08 01:05:55 DEBUG unit.grafana/0.juju-log Operator Framework 1.5.4+2.g320e7e0 up and running.
unit-grafana-0: 2022-12-08 01:05:59 INFO unit.grafana/0.juju-log Running legacy hooks/start.
unit-grafana-0: 2022-12-08 01:06:05 DEBUG unit.grafana/0.juju-log Operator Framework 1.5.4+2.g320e7e0 up and running.
unit-grafana-0: 2022-12-08 01:06:05 DEBUG unit.grafana/0.juju-log Charm called itself via hooks/start.
unit-grafana-0: 2022-12-08 01:06:06 DEBUG unit.grafana/0.juju-log Legacy hooks/start exited with status 0.
unit-grafana-0: 2022-12-08 01:06:06 DEBUG unit.grafana/0.juju-log Reading alert rule from /var/lib/juju/agents/unit-grafana-0/charm/src/prometheus_alert_rules/unit_unavailable.rule
unit-grafana-0: 2022-12-08 01:06:06 DEBUG unit.grafana/0.juju-log Emitting Juju event start.

At this point, relation data contained scrape targets:

$ juju show-unit scrape-config/0
scrape-config/0:
  workload-version: n/a
  opened-ports: []
  charm: ch:amd64/focal/prometheus-scrape-config-k8s-38
  leader: true
  life: alive
  relation-info:
  - relation-id: 24
    endpoint: configurable-scrape-jobs
    related-endpoint: metrics-endpoint
    application-data:
      scrape_jobs: '[{"job_name": "juju_cos-lite-load-test_40652ec_scrape-target_external_jobs",
        "static_configs": [{"targets": ["avalanche.us-central1-a.c.lma-light-load-testing.internal:9001",
        "avalanche.us-central1-a.c.lma-light-load-testing.internal:9002", "avalanche.us-central1-a.c.lma-light-load-testing.internal:9003", ...

But these are not included in the targets:

$ curl -s 10.1.79.219/cos-lite-load-test-prometheus-0/api/v1/targets | jq
{
  "status": "success",
  "data": {
    "activeTargets": [
      {
        "discoveredLabels": {
          "__address__": "localhost:9090",
          "__metrics_path__": "/metrics",
          "__scheme__": "http",
          "__scrape_interval__": "15s",
          "__scrape_timeout__": "10s",
          "job": "prometheus"
        },
        "labels": {
          "instance": "localhost:9090",
          "job": "prometheus"
        },
        "scrapePool": "prometheus",
        "scrapeUrl": "http://localhost:9090/metrics",
        "globalUrl": "http://pd-ssd-4cpu-8gb.us-central1-a.c.lma-light-load-testing.internal:80/metrics",
        "lastError": "server returned HTTP status 404 Not Found",
        "lastScrape": "2022-12-08T18:59:12.275395831Z",
        "lastScrapeDuration": 0.000530596,
        "health": "down",
        "scrapeInterval": "15s",
        "scrapeTimeout": "10s"
      }
    ],
    "droppedTargets": []
  }
}

And the config file is the one baked in the image, not the one generated by the charm:

$ juju ssh --container prometheus prometheus/0 cat /etc/prometheus/prometheus.yml
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

To Reproduce

Not sure. kubectl delete perhaps?

Environment

Edge charms. Juju 2.9.34.

Relevant log output

See above.

Additional context

Related to #398.