Unable to activate alerts + must manually restart monitor to register new alerts

st3xupery commented 6 years ago

My main problem is no matter how restrictive I set my mem limit I cannot get the alert to indicate active on the /alerts page in Prometheus. In the example below you will see I have set my service's mem_limit to 10% where at rest, the service in question uses at least 60% of it's available memory limit, and to be triggered with no timespan. Yet no long how I wait for the alert says (0 active)

      resources:
        limits:
          memory: 1000M
      labels:
        - com.df.notify=true
        - com.df.alertName=memlimit
        - com.df.alertIf=@service_mem_limit:0.1

This is how the alert translates into Prometheus

alert: monitoring_elasticsearch_memlimit 
expr: 
container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"}   / 
container_spec_memory_limit_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"}   > 0.1 
labels:   receiver: system   service: monitoring_elasticsearch annotations:   
summary: Memory of the service monitoring_elasticsearch is over 0.1

When I plug the expr into the Prometheus Expression receiver I get no-data. Not even container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"} seems to produce a result.

Here are the relevant docker-compose instructions

  swarm-listener:
    image: vfarcic/docker-flow-swarm-listener
    networks:
      - proxy
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - DF_NOTIFY_CREATE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/reconfigure
      - DF_NOTIFY_REMOVE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/remove
    deploy:
      placement:
        constraints: [node.role == manager]
  monitor:
    image: vfarcic/docker-flow-monitor:${TAG:-latest}
    environment:
      - LISTENER_ADDRESS=swarm-listener
      - GLOBAL_SCRAPE_INTERVAL=10s
    networks:
      - proxy
    deploy:
      placement:
        constraints:
          - node.role == manager
    ports:
      - 9090:9090

It may be worth noting that I have not incorporated the alert-manager as I didn't want to set it up and figured I could test my alert settings before moving on to that step. Am I wrong in assuming I can continue with docker-flow-monitor without alert-manager.

It's also worth noting that I am using proxy as the shared network between docker-flow-monitor, docker-flow-swarm-listener because I am also using docker-flow-proxy in this stack.

It may also be worth noting that I must manually restart the docker-flow-monitor service for new alerts to register in the prometheus web console after spinning up other services that are not docker-flow-monitor I am not sure if that is intended behavior and perhaps this is a sign of something else wrong.

Nothing in the monitor logs seem to indicate anything is amiss either

proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Requesting services from Docker Flow Swarm Listener
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Processing: [{"alertFor":"30s","alertIf":"@service_mem_limit:0.8","alertName":"memlimit","distribute":"true","replicas":"1","serviceName":"monitoring_kibana"},{"alertIf":"@service_mem_limit:0.1","alertName":"memlimit","distribute":"true","replicas":"1","serviceName":"monitoring_elasticsearch"},{"distribute":"true","port":"80","replicas":"1","serviceName":"proxy_letsencrypt-companion","servicePath":"/.well-known/acme-challenge"}]
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Writing to alert.rules
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Writing to prometheus.yml
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Starting Docker Flow Monitor
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 Starting Prometheus
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | 2018/03/02 02:50:46 /bin/sh -c prometheus --config.file="/etc/prometheus/prometheus.yml" --storage.tsdb.path="/prometheus" --web.console.libraries="/usr/share/prometheus/console_libraries" --web.console.templates="/usr/share/prometheus/consoles"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.425281311Z caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0, branch=HEAD, revision=85f23d82a045d103ea7f3c89a91fba4a93e6367a)"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.425401927Z caller=main.go:226 build_context="(go=go1.9.2, user=root@6e784304d3ff, date=20180119-12:01:23)"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.42546681Z caller=main.go:227 host_details="(Linux 4.4.0-1047-aws #56-Ubuntu SMP Sat Jan 6 19:39:06 UTC 2018 x86_64 ba5f63bfc96a (none))"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.425555206Z caller=main.go:228 fd_limits="(soft=1048576, hard=1048576)"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.428645759Z caller=main.go:499 msg="Starting TSDB ..."
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.438652055Z caller=web.go:383 component=web msg="Start listening for connections" address=0.0.0.0:9090
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.443432951Z caller=main.go:509 msg="TSDB started"
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.443526522Z caller=main.go:585 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.444105035Z caller=main.go:486 msg="Server is ready to receive web requests."
proxy_monitor.1.b0gt7nw8bbt2@fences-dev-master1    | level=info ts=2018-03-02T02:50:46.444482222Z caller=manager.go:59 component="scrape manager" msg="Starting scrape manager..."

I am fully at a loss on how to debug this further. Perhaps I have made some mistake along the way or misunderstand what I should be expecting.

vfarcic commented 6 years ago

Can you execute container_memory_usage_bytes expression and send the output?

st3xupery commented 6 years ago

When I execute

container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"} / container_spec_memory_limit_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"} > 0.1

or

container_memory_usage_bytes{container_label_com_docker_swarm_service_name="monitoring_elasticsearch"}

or

container_memory_usage_bytes

I get no data . Something about my deployment must be off.

st3xupery commented 6 years ago

Here are some additional specs I can find

prometheus --version

prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d82a045d103ea7f3c89a91fba4a93e6367a)
  build user:       root@6e784304d3ff
  build date:       20180119-12:01:23
  go version:       go1.9.2

cat /etc/prometheus/prometheus.yml

global:
  scrape_interval: 10s
rule_files:
- alert.rules

The content of alert.rules looks all in order also.

vfarcic commented 6 years ago

The problem is in DFSL. It has only the proxy as the address in DF_NOTIFY_CREATE_SERVICE_URL and DF_NOTIFY_REMOVE_SERVICE_URL. You need to add (comma separated) the address of Prometheus as well (DFM). Otherwise, it will never receive a notification about exporters.

st3xupery commented 6 years ago

OH! I do see what you are saying and have modified my environment variables accordingly.

  swarm-listener:
    ...
    environment:
      - 'DF_NOTIFY_CREATE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/reconfigure,http://monitor:8080/v1/docker-flow-proxy/reconfigure'
      - 'DF_NOTIFY_REMOVE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/remove,http://monitor:8080/v1/docker-flow-proxy/remove'

What I had to failed to see was proxy was a reference to the service and not to the overlay network. Because of that confusion, I thought to place DFP, DFSL and DFM on the proxy network was sufficient and that proxy in the URL would talk to them all. I do know that's not how overlay networks work, but clearly, I needed a second set of eyeballs to help.

That being said, it has been several minutes now and my aforementioned issues don't seem to have changed.

st3xupery commented 6 years ago

I even went so far as to remove DFP from the equation but none of the queries e.g. container_memory_usage_bytes seem to produce any result in the Prometheus dashboard. Even an error would be more insightful to me.

vfarcic commented 6 years ago

The problem is that you changed the name of the service to monitor but you left the rest of the address intact (http://monitor:8080/v1/docker-flow-proxy/reconfigure). The reconfigure address should be http://monitor:8080/v1/docker-flow-monitor/reconfigure. You can find an example in http://monitor.dockerflow.com/tutorial/ .

st3xupery commented 6 years ago

Oh wow, I feel rather stupid. Well, I appreciate your patience with assisting me as this certainly resolves my issue. Much thanks again!

st3xupery commented 6 years ago

I found some time to revisit this part of my project again hopeful resolving my URL mistake would be the key, but I still find myself with unresponsive alerts and queries that execute to no data

In the example below I keep swarm-listener on a proxy network and a monitor network with DFM sharing the monitor network. But I also tried putting them both on exclusively proxy. In both cases nothing changes.

  swarm-listener:
    image: vfarcic/docker-flow-swarm-listener
    networks:
      - proxy
      - monitor
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - 'DF_NOTIFY_CREATE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/reconfigure,http://monitor:8080/v1/docker-flow-monitor/reconfigure'
      - 'DF_NOTIFY_REMOVE_SERVICE_URL=http://proxy:8080/v1/docker-flow-proxy/remove,http://monitor:8080/v1/docker-flow-monitor/remove'
    deploy:
      placement:
        constraints: [node.role == manager]

  monitor:
    image: vfarcic/docker-flow-monitor
    environment:
      - LISTENER_ADDRESS=swarm-listener
      - GLOBAL_SCRAPE_INTERVAL=10s
    networks:
      - monitor
    ports:
      - 9090:9090

I really wish I could provide more substantial info but I have exhausted all possible logs.

Is there an example that uses both DFM and DFP that you know to work that I can experiment with locally?

vfarcic commented 6 years ago

Please send me the current config of your stacks and I'll try to replicate the problem inside one of my clusters.

P.S. Sorry for not responding earlier. I had too much work on my plate.

vfarcic commented 6 years ago

Closing due to inactivity.

docker-flow / docker-flow-monitor

Unable to activate alerts + must manually restart monitor to register new alerts #39