Closed foureight84 closed 1 year ago
Can you describe your Traefik confiuration and how the routing is done ?
If this is done using labels
, then the routing is destroyed with the docker stack rm
command.
Wether you bring the service up or not, the ondemand service will emit a scale down request.
So one weird behavior is in fact recreating the same stack/service while it is scaled down. Creating the stack manually sets the number of replicas to 1. While you do not access the service, it will be scaled down.
Here are my current docker stacks:
management stack
version: "3.7"
services:
traefik:
image: traefik:latest
ports:
- target: 53
published: 53
protocol: tcp
- target: 53
published: 53
protocol: udp
- target: 80
published: 80
protocol: tcp
environment:
- TZ=US/Los_Angeles
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./traefik.yaml:/etc/traefik/traefik.yaml
- './plugins-local:/plugins-local/'
networks:
- traefik
deploy:
labels:
- traefik.enable=true
- traefik.http.routers.api.rule=Host(`traefik.hades.home`)
- traefik.http.routers.api.service=api@internal
- traefik.http.routers.api.entrypoints=web
- traefik.http.services.api.loadbalancer.server.port=8080
ondemand:
image: ghcr.io/acouvreur/traefik-ondemand-service:1.7
command:
- --swarmMode=true
volumes:
- '/var/run/docker.sock:/var/run/docker.sock'
networks:
- traefik
portainer:
image: portainer/portainer-ce:latest
command: -H unix:///var/run/docker.sock
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- portainer_data:/data
networks:
- traefik
deploy:
labels:
# WebUI
- traefik.enable=true
- traefik.http.routers.portainer.rule=Host(`portainer.hades.home`)
- traefik.http.routers.portainer.entrypoints=web
- traefik.http.services.portainer.loadbalancer.server.port=9000
- traefik.http.routers.portainer.service=portainer
networks:
traefik:
external: true
volumes:
portainer_data:
Here is my monitoring
stack:
version: "3.7"
services:
glances:
image: nicolargo/glances:latest-alpine
restart: always
pid: host
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
environment:
- "GLANCES_OPT=-w"
networks:
- traefik
deploy:
replicas: 0
labels:
- traefik.enable=true
- traefik.http.routers.glances.rule=Host(`glances.hades.home`)
- traefik.http.routers.glances.entrypoints=web
- traefik.http.services.glances.loadbalancer.server.port=61208
- traefik.docker.lbswarm=true
- traefik.http.middlewares.ondemand_glances.plugin.traefik-ondemand-plugin.name=monitoring_glances
- traefik.http.middlewares.ondemand_glances.plugin.traefik-ondemand-plugin.serviceurl=http://ondemand:10000
- traefik.http.middlewares.ondemand_glances.plugin.traefik-ondemand-plugin.timeout=10m
- traefik.http.routers.glances.middlewares=ondemand_glances@docker
networks:
traefik:
external: yes
As you mentioned, the route gets destroyed from Traefik after docker stack rm monitoring
and the spin-down timer will trigger after 10 minutes. However, testing that again but instead of waiting for the timer to spin down, I bring the monitoring stack back up with docker stack deploy
and watch traefik for the routes to get detected. Once the routes are detected, I browse to http://glances.hades.home
and I receive a 502 'Bad Gateway' response.
For the route to work again, I will need to wait until the spin-down has elapsed starting from the timestamp of the last connection attempt.
This is the error log observed from Traefik:
time="2022-04-22T02:15:34Z" level=error msg="2022/04/22 02:15:34 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:15:34Z" level=error msg="2022/04/22 02:15:34 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T02:13:50Z" level=error msg="2022/04/22 02:13:50 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:13:50Z" level=error msg="2022/04/22 02:13:50 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T02:04:22Z" level=error msg="2022/04/22 02:04:22 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:22Z" level=error msg="2022/04/22 02:04:22 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:02Z" level=error msg="2022/04/22 02:04:02 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:02Z" level=error msg="2022/04/22 02:04:02 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
I am also noticing that with normal behavior, Traefik seems to be treating log output from the plugin as an error.
This is continuously generated while ondemand is detecting activity on the route:
time="2022-04-22T02:15:34Z" level=error msg="2022/04/22 02:15:34 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T02:13:50Z" level=error msg="2022/04/22 02:13:50 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:13:50Z" level=error msg="2022/04/22 02:13:50 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T02:04:22Z" level=error msg="2022/04/22 02:04:22 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:22Z" level=error msg="2022/04/22 02:04:22 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:02Z" level=error msg="2022/04/22 02:04:02 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T02:04:02Z" level=error msg="2022/04/22 02:04:02 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T03:16:03Z" level=error msg="2022/04/22 03:16:03 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
time="2022-04-22T03:16:03Z" level=error msg="2022/04/22 03:16:03 Status: started" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T03:16:03Z" level=error msg="2022/04/22 03:16:03 Sending request: http://ondemand:10000?name=monitoring_glances&timeout=10m0s" module=github.com/acouvreur/traefik-ondemand-plugin plugin=plugin-traefik-ondemand-plugin
time="2022-04-22T03:15:59Z" level=error msg="2022/04/22 03:15:59 Status: started" plugin=plugin-traefik-ondemand-plugin module=github.com/acouvreur/traefik-ondemand-plugin
The OnDemand service would have stored the servce as "started".
Which means as soon as you bring back up the stack, the ondemand middleware will forward the request to the service as its last state was "started".
You indeed specified replicas: 0
. Which means that it was scaled down manually. That explains the 502 bad gateway
As for now, the OnDemand service is not monitoring for external state changes. So it trusts the internal state.
This could be fixed with two changes:
Both could be implemented, what do you think ?
Thanks for elaborating on that.
Would it be possible to check the external state when the middleware plugin is triggered? Whereupon detection of the service in a downstate will reset the timer and spin-up service. This is a blend of the two solutions you mentioned.
If that's not possible, I would say a manual timer reset is probably more inlined with the minimal resource usage that is intended by this project. Moreover, in a normal use case, I don't think there will be frequent manual removal and deployment outside of testing, of which, ondemand configuration should probably be left out until the final deployment stage.
The goal of the internal state is to avoid hammering the API for checks.
Web app such as Portainer makes a lot of requests. If the plugin were to check if the service is up for every request before forwarding them, there would be a huge performance loss.
It is possible, but I would not recommend going in this direction.
A background polling with the keys might be a better solution. The same way traefik does, it polls every 5s (by default) the services
Gotcha. That's a really good insight. Thanks for clarifying that! Background polling is always nicer than having to manually request the timer reset and sounds great that it's more efficient.
I know that Kubernetes provides some kind of mechanism to avoid hammering the API. See https://pkg.go.dev/k8s.io/client-go/informers
This is now a feature released for docker
, docker swarm
and kubernetes
!
You can see the details here:
https://github.com/acouvreur/sablier/commit/a62f098d42a3860bfc841e6e008a3eba3da1362e https://github.com/acouvreur/sablier/commit/1ca1934b1c57f5b45b269d6045dd1dcbe2d608c2 https://github.com/acouvreur/sablier/commit/e11cd858532b7f13e1d653e952440c6445ed3c38 (still in beta because I didnt write the tests)
I have a service that is set to spin down after 10 minutes and I've noticed that if I take the service offline and back online during that time frame via docker stack rm and deploy, I won't be able to access the service until the spin-down timer has been reached. I am currently running this plugin in local mode instead of using Pilot (not sure if this issue applies if deployed using Pilot).
Is there a way to manually reset the spin-down timer?