@service_mem_limit_nobuff not expanded properly in prometheus alert.rules

Hey guys, upon discovering that one of our services was alerting on high memory usage due to filling up it's page cache, I tried to change the memory alert definition from com.df.alertIf=@service_mem_limit to com.df.alertIf=@service_mem_limit_nobuff. I changed this label in my stack file and redeployed the stack, but I am still seeing the old alert definition in prometheus.

Upon further investigation, it appears that the alert that was sent to prometheus was malformed yaml, and thus it fell back on the previous alert definitions. Here is what the invalid definitions look like, with a valid one included for reference. Looks like the shortcut was not expanded properly.

- alert: monitor_monitor_memlimit
    expr: container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitor_monitor"}/container_spec_memory_limit_bytes{container_label_com_docker_swarm_service_name="monitor_monitor"} > 0.9
    for: 30s
    labels:
      receiver: system
      service: monitor_monitor
    annotations:
      summary: "Memory of the service monitor_monitor is over 0.9"
  - alert: nexus_haproxy_memlimit
    expr: @service_mem_limit_nobuff:0.8
    for: 30s
  - alert: nexus_nexus_memlimit
    expr: @service_mem_limit_nobuff:0.8
    for: 30s

Here is what the logs for docker flow monitor look like for this.

monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=info ts=2018-09-17T17:00:18.441343037Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=92.096158ms monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Processing /v1/docker-flow-monitor/reconfigure?alertFor=30s&alertIf=%!s(MISSING)ervice_mem_limit_nobuff%!A(MISSING)0.8&alertName=memlimit&distribute=true&port=8081&redirectWhenHttpProto=true&replicas=1&serviceDomain=nexus.staging-gridpl.us&serviceName=nexus_nexus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Adding alert memlimit for the service nexus_nexus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | %!(EXTRA prometheus.Alert={map[] 30s @service_mem_limit_nobuff:0.8 map[] memlimit false nexus_nexus_memlimit nexus_nexus 1}) monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Writing to alert.rules monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Writing to prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Reloading Prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 pkill -HUP prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=info ts=2018-09-17T18:55:04.165360131Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Prometheus was reloaded monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:04.180208814Z caller=manager.go:479 component="rule manager" msg="loading groups failed" err="yaml: line 108: found character that cannot start any token" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:04.180249509Z caller=main.go:607 msg="Failed to apply configuration" err="error loading rules, previous rule set restored" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:04.18027167Z caller=main.go:451 msg="Error reloading config" err="one or more errors occurred while applying the new configuration (--config.file=/etc/prometheus/prometheus.yml)" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Processing /v1/docker-flow-monitor/reconfigure?alertFor=30s&alertIf=%!s(MISSING)ervice_mem_limit_nobuff%!A(MISSING)0.8&alertName=memlimit&distribute=true&port=9000&replicas=2&serviceDomain=docker.staging-gridpl.us&serviceName=nexus_haproxy monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Adding alert memlimit for the service nexus_haproxy monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | %!(EXTRA prometheus.Alert={map[] 30s @service_mem_limit_nobuff:0.8 map[] memlimit false nexus_haproxy_memlimit nexus_haproxy 2}) monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Writing to alert.rules monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Writing to prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Reloading Prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 pkill -HUP prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=info ts=2018-09-17T18:55:11.249477453Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Prometheus was reloaded monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:11.264679519Z caller=manager.go:479 component="rule manager" msg="loading groups failed" err="yaml: line 100: found character that cannot start any token" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:11.264715853Z caller=main.go:607 msg="Failed to apply configuration" err="error loading rules, previous rule set restored" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:11.26473839Z caller=main.go:451 msg="Error reloading config" err="one or more errors occurred while applying the new configuration (--config.file=/etc/prometheus/prometheus.yml)"

docker-flow / docker-flow-monitor

@service_mem_limit_nobuff not expanded properly in prometheus alert.rules #64