canonical / prometheus-k8s-operator

This charmed operator automates the operational procedures of running Prometheus, an open-source metrics backend.
https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 34 forks source link

Error in scrape job definition pass silently #633

Open Abuelodelanada opened 1 month ago

Abuelodelanada commented 1 month ago

Bug Description

Prometheus is unable to unmarshal params sent through Prometheus scrape target. Prometheus remains in Active state

To Reproduce

  1. Deploy Prometheus: juju deploy prometheus-k8s prom --channel edge --trust
  2. Deploy Prometheus scrape target: juju deploy prometheus-scrape-target-k8s scrape --channel edge
  3. Config Prometheus scrape target:
    • juju config prometheus-scrape-target-k8s targets=192.168.0.248:9116
    • juju config prometheus-scrape-target-k8s labels="job:cumulus"
    • juju config prometheus-scrape-target-k8s metrics_path="/snmp"
    • juju config scrape params='{"auth": "snmp_v3", "module": "if_mib_if_name", "target": "192.168.100.200"}'
  4. Relate Prometheus to Prometheus scrape target: juju relate prom scrape
  5. Verify this scrape job is not included in Prometheus:
    $ juju ssh --container prometheus prom/0 cat /etc/prometheus/prometheus.yml                                                            
    global:
     evaluation_interval: 1m
     scrape_interval: 1m
     scrape_timeout: 10s
    rule_files:
    - /etc/prometheus/rules/juju_*.rules
    scrape_configs:
    - honor_timestamps: true
     job_name: prometheus
     metrics_path: /metrics
     relabel_configs:
     - regex: (.*)
       separator: _
       source_labels:
       - juju_model
       - juju_model_uuid
       - juju_application
       - juju_unit
       target_label: instance
     scheme: http
     scrape_interval: 5s
     scrape_timeout: 5s
     static_configs:
     - labels:
         host: localhost
         juju_application: prom
         juju_charm: prometheus-k8s
         juju_model: dmytro
         juju_model_uuid: 67755b6d-9410-46d9-8617-ee7c87d285c2
         juju_unit: prom/0
       targets:
       - prom-0.prom-endpoints.dmytro.svc.cluster.local:9090

Alternatively it is possible to use this bundle:

bundle: kubernetes
applications:
  prom:
    charm: prometheus-k8s
    channel: latest/edge
    revision: 210
    resources:
      prometheus-image: 149
    scale: 1
    constraints: arch=amd64
    storage:
      database: kubernetes,1,1024M
    trust: true
  scrape:
    charm: prometheus-scrape-target-k8s
    channel: latest/edge
    revision: 34
    scale: 1
    options:
      labels: job:cumulus
      params: '{"auth": "snmp_v3", "module": "if_mib_if_name", "target": "192.168.100.200"}'
      targets: 192.168.0.248:9116
    constraints: arch=amd64
relations:
- - prom:metrics-endpoint
  - scrape:metrics-endpoint

Environment

Model   Controller  Cloud/Region        Version  SLA          Timestamp
dmytro  microk8s    microk8s/localhost  3.5.2    unsupported  16:23:20-03:00

App     Version  Status  Scale  Charm                         Channel      Rev  Address        Exposed  Message
prom    2.52.0   active      1  prometheus-k8s                latest/edge  210  10.152.183.22  no       
scrape  n/a      active      1  prometheus-scrape-target-k8s  latest/edge   34  10.152.183.36  no       

Unit       Workload  Agent  Address     Ports  Message
prom/0*    active    idle   10.1.9.252         
scrape/0*  active    idle   10.1.9.217         

Integration provider     Requirer               Interface          Type     Message
prom:prometheus-peers    prom:prometheus-peers  prometheus_peers   peer     
scrape:metrics-endpoint  prom:metrics-endpoint  prometheus_scrape  regular 

Relevant log output

unit-prom-0: 16:07:08.634 INFO unit.prom/0.juju-log metrics-endpoint:3: reqs=ResourceRequirements(claims=None, limits={}, requests={'cpu': '0.25', 'memory': '200Mi'}), templated=ResourceRequirements(claims=None, limits=None, requests={'cpu': '250m', 'memory': '200Mi'}), actual=ResourceRequirements(claims=None, limits=None, requests={'cpu': '250m', 'memory': '200Mi'})
unit-prom-0: 16:07:08.672 DEBUG unit.prom/0.juju-log metrics-endpoint:3: No alertmanagers available
unit-prom-0: 16:07:08.704 ERROR unit.prom/0.juju-log metrics-endpoint:3: Validating scrape jobs failed: b'time="2024-07-17T19:07:08Z" level=fatal msg="parsing YAML file /tmp/tmpe9yyw2pz: yaml: unmarshal errors:\\n  line 4: cannot unmarshal !!str `snmp_v3` into []string\\n  line 5: cannot unmarshal !!str `if_mib_...` into []string\\n  line 6: cannot unmarshal !!str `192.168...` into []string"\n'
unit-prom-0: 16:07:08.757 INFO unit.prom/0.juju-log metrics-endpoint:3: Pushed new configuration

Additional context

No response

lucabello commented 1 week ago

We should also do this for alert rules. Currently, if you relate to cos-config and make a typo in one alert rule, all of them will disappear from Prometheus, and everything will stay in active/idle.

We should either validate on cos-config and set that to blocked, or validate in Prometheus.