Detection for manually stop/start services and resetting spin-down timer

foureight84 commented 2 years ago

I have a service that is set to spin down after 10 minutes and I've noticed that if I take the service offline and back online during that time frame via docker stack rm and deploy, I won't be able to access the service until the spin-down timer has been reached. I am currently running this plugin in local mode instead of using Pilot (not sure if this issue applies if deployed using Pilot).

Is there a way to manually reset the spin-down timer?

acouvreur commented 2 years ago

Can you describe your Traefik confiuration and how the routing is done ?

If this is done using labels, then the routing is destroyed with the docker stack rm command.

Wether you bring the service up or not, the ondemand service will emit a scale down request.

So one weird behavior is in fact recreating the same stack/service while it is scaled down. Creating the stack manually sets the number of replicas to 1. While you do not access the service, it will be scaled down.

foureight84 commented 2 years ago

Here are my current docker stacks:

management stack


version: "3.7"

services:
  traefik:
    image: traefik:latest
    ports:
      - target: 53
        published: 53
        protocol: tcp
      - target: 53
        published: 53
        protocol: udp
      - target: 80
        published: 80
        protocol: tcp
    environment:
      - TZ=US/Los_Angeles
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./traefik.yaml:/etc/traefik/traefik.yaml
      - './plugins-local:/plugins-local/'
    networks:
      - traefik
    deploy:
      labels:
        - traefik.enable=true
        - traefik.http.routers.api.rule=Host(`traefik.hades.home`)
        - traefik.http.routers.api.service=api@internal
        - traefik.http.routers.api.entrypoints=web
        - traefik.http.services.api.loadbalancer.server.port=8080

  ondemand:
    image: ghcr.io/acouvreur/traefik-ondemand-service:1.7
    command:
      - --swarmMode=true
    volumes:
      - '/var/run/docker.sock:/var/run/docker.sock'
    networks:
      - traefik

  portainer:
    image: portainer/portainer-ce:latest
    command: -H unix:///var/run/docker.sock
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - portainer_data:/data
    networks:
      - traefik
    deploy:
      labels:
        # WebUI
        - traefik.enable=true
        - traefik.http.routers.portainer.rule=Host(`portainer.hades.home`)
        - traefik.http.routers.portainer.entrypoints=web
        - traefik.http.services.portainer.loadbalancer.server.port=9000
        - traefik.http.routers.portainer.service=portainer

networks:
  traefik:
    external: true

volumes:
  portainer_data:

Here is my monitoring stack:


version: "3.7"

services:
  glances:
    image: nicolargo/glances:latest-alpine
    restart: always
    pid: host
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      - "GLANCES_OPT=-w"
    networks:
      - traefik
    deploy:
      replicas: 0
      labels:
        - traefik.enable=true
        - traefik.http.routers.glances.rule=Host(`glances.hades.home`)
        - traefik.http.routers.glances.entrypoints=web
        - traefik.http.services.glances.loadbalancer.server.port=61208
        - traefik.docker.lbswarm=true
        - traefik.http.middlewares.ondemand_glances.plugin.traefik-ondemand-plugin.name=monitoring_glances
        - traefik.http.middlewares.ondemand_glances.plugin.traefik-ondemand-plugin.serviceurl=http://ondemand:10000
        - traefik.http.middlewares.ondemand_glances.plugin.traefik-ondemand-plugin.timeout=10m
        - traefik.http.routers.glances.middlewares=ondemand_glances@docker

networks:
  traefik:
    external: yes

As you mentioned, the route gets destroyed from Traefik after docker stack rm monitoring and the spin-down timer will trigger after 10 minutes. However, testing that again but instead of waiting for the timer to spin down, I bring the monitoring stack back up with docker stack deploy and watch traefik for the routes to get detected. Once the routes are detected, I browse to http://glances.hades.home and I receive a 502 'Bad Gateway' response.

For the route to work again, I will need to wait until the spin-down has elapsed starting from the timestamp of the last connection attempt.

This is the error log observed from Traefik:


time="2022-04-22T02:15:34Z" level=error msg="2022/04/22 02:15:34 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:15:34Z" level=error msg="2022/04/22 02:15:34 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T02:13:50Z" level=error msg="2022/04/22 02:13:50 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:13:50Z" level=error msg="2022/04/22 02:13:50 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T02:04:22Z" level=error msg="2022/04/22 02:04:22 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:22Z" level=error msg="2022/04/22 02:04:22 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:02Z" level=error msg="2022/04/22 02:04:02 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:02Z" level=error msg="2022/04/22 02:04:02 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin

foureight84 commented 2 years ago

I am also noticing that with normal behavior, Traefik seems to be treating log output from the plugin as an error.

This is continuously generated while ondemand is detecting activity on the route:

time="2022-04-22T02:15:34Z" level=error msg="2022/04/22 02:15:34 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T02:13:50Z" level=error msg="2022/04/22 02:13:50 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:13:50Z" level=error msg="2022/04/22 02:13:50 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T02:04:22Z" level=error msg="2022/04/22 02:04:22 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:22Z" level=error msg="2022/04/22 02:04:22 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:02Z" level=error msg="2022/04/22 02:04:02 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:02Z" level=error msg="2022/04/22 02:04:02 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T03:16:03Z" level=error msg="2022/04/22 03:16:03 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T03:16:03Z" level=error msg="2022/04/22 03:16:03 Status: started" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T03:16:03Z" level=error msg="2022/04/22 03:16:03 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T03:15:59Z" level=error msg="2022/04/22 03:15:59 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin

acouvreur commented 2 years ago

The OnDemand service would have stored the servce as "started".

Which means as soon as you bring back up the stack, the ondemand middleware will forward the request to the service as its last state was "started".

You indeed specified replicas: 0. Which means that it was scaled down manually. That explains the 502 bad gateway

As for now, the OnDemand service is not monitoring for external state changes. So it trusts the internal state.

This could be fixed with two changes:

Create a route to manually reset the timer
Monitor external state changes with polling

Both could be implemented, what do you think ?

foureight84 commented 2 years ago

Thanks for elaborating on that.

Would it be possible to check the external state when the middleware plugin is triggered? Whereupon detection of the service in a downstate will reset the timer and spin-up service. This is a blend of the two solutions you mentioned.

If that's not possible, I would say a manual timer reset is probably more inlined with the minimal resource usage that is intended by this project. Moreover, in a normal use case, I don't think there will be frequent manual removal and deployment outside of testing, of which, ondemand configuration should probably be left out until the final deployment stage.

acouvreur commented 2 years ago

The goal of the internal state is to avoid hammering the API for checks.

Web app such as Portainer makes a lot of requests. If the plugin were to check if the service is up for every request before forwarding them, there would be a huge performance loss.

It is possible, but I would not recommend going in this direction.

A background polling with the keys might be a better solution. The same way traefik does, it polls every 5s (by default) the services

foureight84 commented 2 years ago

Gotcha. That's a really good insight. Thanks for clarifying that! Background polling is always nicer than having to manually request the timer reset and sounds great that it's more efficient.

acouvreur commented 1 year ago

I know that Kubernetes provides some kind of mechanism to avoid hammering the API. See https://pkg.go.dev/k8s.io/client-go/informers

acouvreur commented 1 year ago

This is now a feature released for docker, docker swarm and kubernetes!

You can see the details here:

https://github.com/acouvreur/sablier/commit/a62f098d42a3860bfc841e6e008a3eba3da1362e https://github.com/acouvreur/sablier/commit/1ca1934b1c57f5b45b269d6045dd1dcbe2d608c2 https://github.com/acouvreur/sablier/commit/e11cd858532b7f13e1d653e952440c6445ed3c38 (still in beta because I didnt write the tests)

acouvreur / sablier

Detection for manually stop/start services and resetting spin-down timer #25