canonical / prometheus-k8s-operator

https://charmhub.io/prometheus-k8s
Apache License 2.0
21 stars 34 forks source link

Broken scrape jobs can get past our checks #554

Open dstathis opened 11 months ago

dstathis commented 11 months ago

Bug Description

When a certain set of scrape jobs are deployed, Our scrape job validation is "fooled" and the scrape jobs are written to disk causing Prometheus to fail.

To Reproduce

Deploy the attached bundle and relate to cos. (adjust saas section as needed) machine_model_bundle.txt Here is the charm used in the bundle in case the branch goes away. (remove .txt file extension) grafana-agent_ubuntu-22.04-amd64.charm.txt

Environment

                           juju info v0.1                            
┌──────────────┬────────────────────────────────────────────────────┐
│ jhack        │ 0.3.18.3                                           │
│ python       │ 3.10.12 (/home/dylan/repos/jhack/venv/bin/python3) │
│ juju-* snaps │  juju      │ 3.3.0 - 25355 (latest/stable)         │
│              │  juju-wait │ 2.8.4~2.8.4 - 96 (stable)             │
│ microk8s     │ MicroK8s v1.28.3 revision 6089                     │
│ lxd          │ 5.19                                               │
│ multipass    │ 1.12.2                                             │
│ multipassd   │ 1.12.2                                             │
│ os           │ Ubuntu 22.04.3 LTS                                 │
│ kernel       │ Linux 5.15.0-89-generic x86_64                     │
└──────────────┴────────────────────────────────────────────────────┘

Relevant log output

unit-prometheus-0: 14:43:14 ERROR unit.prometheus/0.juju-log Invalid prometheus configuration. Stdout: Checking /etc/prometheus/prometheus.yml
  SUCCESS: 6 rule files found
 SUCCESS: /etc/prometheus/prometheus.yml is valid prometheus config file syntax

Checking /etc/prometheus/rules/juju_lma_0c334414_alertmanager_metrics-endpoint_17.rules
  SUCCESS: 4 rules found

Checking /etc/prometheus/rules/juju_lma_0c334414_grafana_metrics-endpoint_19.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_lma_0c334414_loki_metrics-endpoint_18.rules
  SUCCESS: 4 rules found

Checking /etc/prometheus/rules/juju_lma_0c334414_traefik_metrics-endpoint_16.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_machine_006cfa02_zookeeper.rules

Checking /etc/prometheus/rules/juju_stuff_6e55ee64_agent.rules
  SUCCESS: 35 rules found

 Stderr:   FAILED:
lint error 39 duplicate rule(s) found.
Metric: CollectorFailed
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: error
Metric: IPMICurrentStateNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: {{ toLower $labels.state }}
Metric: IPMIDCMICommandFailed
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: IPMIDCMIPowerConsumptionPercentageOutstanding
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: warning
Metric: IPMIFanSpeedStateNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: {{ toLower $labels.state }}
Metric: IPMIMonitoringCommandFailed
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: IPMIPowerStateNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: {{ toLower $labels.state }}
Metric: IPMISELCommandFailed
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: IPMISELStateCritical
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: IPMISELStateWarning
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: warning
Metric: IPMISensorStateNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: {{ toLower $labels.state }}
Metric: IPMITemperatureStateNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: {{ toLower $labels.state }}
Metric: IPMIVoltageStateNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: {{ toLower $labels.state }}
Metric: LSISASControllerNotFound
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: warning
Metric: LSISASIRVolumeNotFound
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: warning
Metric: LSISASIRVolumeUnready
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: LSISASPhysicalDiskUnready
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: MegaRAIDControllerNotFound
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: warning
Metric: MegaRAIDVirtualDriveNotOptimal
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: warning
Metric: PerccliCommandFailed
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: PowerEdgeRAIDControllerNotFound
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: warning
Metric: PowerEdgeRAIDControllerSuccess
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: PowerEdgeRAIDVirtualDriveNotOptimal
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: warning
Metric: RedfishCallFailed
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: warning
Metric: RedfishChassisHealthNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: RedfishMemoryDimmHealthNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: RedfishProcessorHealthNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: RedfishSensorHealthNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: RedfishServiceUnavailable
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: warning
Metric: RedfishSmartStorageHealthNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: RedfishStorageControllerHealthNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: RedfishStorageDriveHealthNotOk
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: SasircuCommandFailed
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: SsaCLICommandFailed
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: SsaCLIControllerNotFound
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: warning
Metric: SsaCLIControllerNotOK
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: SsaCLILogicalDriveNotOK
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: SsaCLIPhysicalDriveNotOK
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Metric: StorcliCommandFailed
Label(s):
    juju_application: hob
    juju_charm: hardware-observer
    juju_model: machine
    juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
    severity: critical
Might cause inconsistency while recording expressions

Additional context

No response

lucabello commented 2 months ago

First, we should validate the scrape jobs with promtool; if we find one that's malformed, we should set the charm to Blocked. We don't want to stop Prometheus, because having Blocked is better than an outage.

If schema validation fails for the scrape jobs coming from one relation, we omit those scrape jobs from the final configuration, and we set the charm to Blocked.

We need to add the same behavior in Grafana Agent, because that can also scrape metrics. We should probably have some helper function in the Prometheus library to handle that.