Closed dviator closed 6 years ago
Apologies for this one guys, the problem was that I was running an older version of docker flow monitor. After updating to the latest docker image which had defined the @service_mem_limit_nobuff shortcut and redeploying my service, everything worked fine.
I mistakenly believed we were running the latest version of docker flow monitor because I had checked the releases page here: Releases
It appears that page is no longer being updated with new images, checking the tags on docker hub and looking through the PR's for the shortcut I was interested in made me realize there had been quite a few new released images.
Hey guys, upon discovering that one of our services was alerting on high memory usage due to filling up it's page cache, I tried to change the memory alert definition from com.df.alertIf=@service_mem_limit to com.df.alertIf=@service_mem_limit_nobuff. I changed this label in my stack file and redeployed the stack, but I am still seeing the old alert definition in prometheus.
Upon further investigation, it appears that the alert that was sent to prometheus was malformed yaml, and thus it fell back on the previous alert definitions. Here is what the invalid definitions look like, with a valid one included for reference. Looks like the shortcut was not expanded properly.
Here is what the logs for docker flow monitor look like for this.
monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=info ts=2018-09-17T17:00:18.441343037Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=92.096158ms monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Processing /v1/docker-flow-monitor/reconfigure?alertFor=30s&alertIf=%!s(MISSING)ervice_mem_limit_nobuff%!A(MISSING)0.8&alertName=memlimit&distribute=true&port=8081&redirectWhenHttpProto=true&replicas=1&serviceDomain=nexus.staging-gridpl.us&serviceName=nexus_nexus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Adding alert memlimit for the service nexus_nexus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | %!(EXTRA prometheus.Alert={map[] 30s @service_mem_limit_nobuff:0.8 map[] memlimit false nexus_nexus_memlimit nexus_nexus 1}) monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Writing to alert.rules monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Writing to prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Reloading Prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 pkill -HUP prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=info ts=2018-09-17T18:55:04.165360131Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:04 Prometheus was reloaded monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:04.180208814Z caller=manager.go:479 component="rule manager" msg="loading groups failed" err="yaml: line 108: found character that cannot start any token" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:04.180249509Z caller=main.go:607 msg="Failed to apply configuration" err="error loading rules, previous rule set restored" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:04.18027167Z caller=main.go:451 msg="Error reloading config" err="one or more errors occurred while applying the new configuration (--config.file=/etc/prometheus/prometheus.yml)" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Processing /v1/docker-flow-monitor/reconfigure?alertFor=30s&alertIf=%!s(MISSING)ervice_mem_limit_nobuff%!A(MISSING)0.8&alertName=memlimit&distribute=true&port=9000&replicas=2&serviceDomain=docker.staging-gridpl.us&serviceName=nexus_haproxy monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Adding alert memlimit for the service nexus_haproxy monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | %!(EXTRA prometheus.Alert={map[] 30s @service_mem_limit_nobuff:0.8 map[] memlimit false nexus_haproxy_memlimit nexus_haproxy 2}) monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Writing to alert.rules monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Writing to prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Reloading Prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 pkill -HUP prometheus monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=info ts=2018-09-17T18:55:11.249477453Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | 2018/09/17 18:55:11 Prometheus was reloaded monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:11.264679519Z caller=manager.go:479 component="rule manager" msg="loading groups failed" err="yaml: line 100: found character that cannot start any token" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:11.264715853Z caller=main.go:607 msg="Failed to apply configuration" err="error loading rules, previous rule set restored" monitor_monitor.1.fvne2ow692od@ip-172-31-6-89.ec2.internal | level=error ts=2018-09-17T18:55:11.26473839Z caller=main.go:451 msg="Error reloading config" err="one or more errors occurred while applying the new configuration (--config.file=/etc/prometheus/prometheus.yml)"