cloud-gov / cg-deploy-prometheus

Other
4 stars 3 forks source link

Help wanted: aide alerting rule #236

Closed JohannesFleischer closed 3 months ago

JohannesFleischer commented 4 months ago

I'm completely new to Prometheus and just starting to use it. My current goal is to set up an alerting rule for aide. Therefore, I searched for examples and found this configuration snipped in your repo:

- type: replace
  path: /instance_groups/name=prometheus/jobs/name=prometheus2/properties/prometheus/custom_rules?/-
  value:
    name: aide
    rules:
    - alert: AideViolations
      expr: aide_violation_count > 0
      labels:
        service: aide
        severity: warning
      annotations:
        summary: 'AIDE found violations for {{$labels.instance}}, with action {{$labels.action}}'
        description: Review AIDE report in logs-platform - if changes are expected, update AIDE database on alerting VM

And I really don't understand how this works and would be very happy if you could explain it to me. Especially where the aide_violation_count value comes from, because I can't find any mention of Prometheus supporting aide natively and don't see where this value gets scraped.

Sorry for any inconveniences.

ChrisMcGowan commented 3 months ago

Hi @JohannesFleischer

Quick explanation. The value of aide_violation_count in the time series DB of Prometheus is being set inside our Aide BOSH Release. The release has a run_report task that generates a file that contains that metric. You can see that being done here.

On all our BOSH deployments, we co-locate the node_exporter release which is configured to read in files placed in it's config folder and push the file contents to Prometheus via the node gateway deployed with Prometheus. The run_report task has logic to find the errors and increment that count or leave it at zero. Any value greater then zero triggers the alert.