docker-flow / docker-flow-monitor

MIT License
87 stars 38 forks source link

@service_mem_limit_nobuff not expanded properly in prometheus alert.rules #64

Closed dviator closed 6 years ago

dviator commented 6 years ago

Hey guys, upon discovering that one of our services was alerting on high memory usage due to filling up it's page cache, I tried to change the memory alert definition from com.df.alertIf=@service_mem_limit to com.df.alertIf=@service_mem_limit_nobuff. I changed this label in my stack file and redeployed the stack, but I am still seeing the old alert definition in prometheus.

Upon further investigation, it appears that the alert that was sent to prometheus was malformed yaml, and thus it fell back on the previous alert definitions. Here is what the invalid definitions look like, with a valid one included for reference. Looks like the shortcut was not expanded properly.

- alert: monitor_monitor_memlimit
    expr: container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitor_monitor"}/container_spec_memory_limit_bytes{container_label_com_docker_swarm_service_name="monitor_monitor"} > 0.9
    for: 30s
    labels:
      receiver: system
      service: monitor_monitor
    annotations:
      summary: "Memory of the service monitor_monitor is over 0.9"
  - alert: nexus_haproxy_memlimit
    expr: @service_mem_limit_nobuff:0.8
    for: 30s
  - alert: nexus_nexus_memlimit
    expr: @service_mem_limit_nobuff:0.8
    for: 30s

Here is what the logs for docker flow monitor look like for this.

monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=info ts=2018-09-17T17:00:18.441343037Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=92.096158ms monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Processing /v1/docker-flow-monitor/reconfigure?alertFor=30s&alertIf=%!s(MISSING)ervice_mem_limit_nobuff%!A(MISSING)0.8&alertName=memlimit&distribute=true&port=8081&redirectWhenHttpProto=true&replicas=1&serviceDomain=nexus.staging-gridpl.us&serviceName=nexus_nexus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Adding alert memlimit for the service nexus_nexus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | %!(EXTRA prometheus.Alert={map[] 30s @service_mem_limit_nobuff:0.8 map[] memlimit false nexus_nexus_memlimit nexus_nexus 1}) monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Writing to alert.rules monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Writing to prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Reloading Prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 pkill -HUP prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=info ts=2018-09-17T18:55:04.165360131Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Prometheus was reloaded monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:04.180208814Z caller=manager.go:479 component="rule manager" msg="loading groups failed" err="yaml: line 108: found character that cannot start any token" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:04.180249509Z caller=main.go:607 msg="Failed to apply configuration" err="error loading rules, previous rule set restored" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:04.18027167Z caller=main.go:451 msg="Error reloading config" err="one or more errors occurred while applying the new configuration (--config.file=/etc/prometheus/prometheus.yml)" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Processing /v1/docker-flow-monitor/reconfigure?alertFor=30s&alertIf=%!s(MISSING)ervice_mem_limit_nobuff%!A(MISSING)0.8&alertName=memlimit&distribute=true&port=9000&replicas=2&serviceDomain=docker.staging-gridpl.us&serviceName=nexus_haproxy monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Adding alert memlimit for the service nexus_haproxy monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | %!(EXTRA prometheus.Alert={map[] 30s @service_mem_limit_nobuff:0.8 map[] memlimit false nexus_haproxy_memlimit nexus_haproxy 2}) monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Writing to alert.rules monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Writing to prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Reloading Prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 pkill -HUP prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=info ts=2018-09-17T18:55:11.249477453Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Prometheus was reloaded monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:11.264679519Z caller=manager.go:479 component="rule manager" msg="loading groups failed" err="yaml: line 100: found character that cannot start any token" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:11.264715853Z caller=main.go:607 msg="Failed to apply configuration" err="error loading rules, previous rule set restored" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:11.26473839Z caller=main.go:451 msg="Error reloading config" err="one or more errors occurred while applying the new configuration (--config.file=/etc/prometheus/prometheus.yml)"

dviator commented 6 years ago

Apologies for this one guys, the problem was that I was running an older version of docker flow monitor. After updating to the latest docker image which had defined the @service_mem_limit_nobuff shortcut and redeploying my service, everything worked fine.

I mistakenly believed we were running the latest version of docker flow monitor because I had checked the releases page here: Releases

It appears that page is no longer being updated with new images, checking the tags on docker hub and looking through the PR's for the shortcut I was interested in made me realize there had been quite a few new released images.