Old instance 50% of the time

bartoszkrawczyk2 commented 3 months ago

Hi

I'm using Traefik and after running docker rollout [name] every other request is routed to the old instance, until it's removed. For fast starting apps it's probably not an issue, but some apps can take quite some time to start and 50% of users would see nothing (or bad gateway error) during deployment.

I'm guessing that this is probably more Traefik's issue than docker-rollout, but maybe someone here knows how to help?

This is my compose file:

services:
  traefik:
    image: traefik:v3.0
    command:
      - "--api.insecure=true"
      - "--providers.docker"
      - "--entrypoints.web.address=:80"
      - "--providers.docker.exposedbydefault=false"
      - "--accesslog=true"
      - "--log.level=DEBUG"
    ports:
      - "80:80"
      - "8080:8080"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

  whoami:
    image: traefik/whoami
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.whoami.entrypoints=web"
      - "traefik.http.routers.whoami.rule=Host(`whoami.localhost`)"

  app:
    build:
      context: .
      dockerfile: Dockerfile
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.app.entrypoints=web"
      - "traefik.http.routers.app.rule=Host(`app.localhost`)"
    restart: unless-stopped

pedroterzero commented 3 months ago

As far as I know, docker itself doesn't have any features to drain connections to (a number of) containers. I think you'd have to do this on your application layer (app) in this case. For example by somehow telling the app it needs to start draining, return 503 for all new connections, wait 60 seconds, then stop the container.

wowu commented 3 months ago

Adding a healthcheck should prevent Traefik from routing requests to unhealthy containers. Docker Rollout will also wait for all new containers to be healthy before removing the old ones.

See https://docs.docker.com/compose/compose-file/05-services/#healthcheck

pedroterzero commented 3 months ago

Adding a healthcheck should prevent Traefik from routing requests to unhealthy containers. Docker Rollout will also wait for all new containers to be healthy before removing the old ones.

One still needs a mechanism to tell the 'old' container(s) to stop accepting traffic or become 'unhealthy' when the new ones have been rolled out though, right?

This should be after the new instances have been rolled out but before the old ones are removed.

Sort of related to https://github.com/Wowu/docker-rollout/issues/21, perhaps?

SponsorAds commented 2 months ago

All these zero downtime promises with compose are from uneducted people that do not know how docker or networking work. It is NOT possible! Traefik will keep seeing the container while the killed processes inside might already prevent it from working correctly -> down time. This can be especially problematic when the stopping takes a bit longer (easily possible when e.g. a streaming service needs to deregister hundreds of ports). You cannot update traefik to let it know, that is one of the bigger issues of declarative configuration.

A health check directly from traefik will obviously also result in down times, as that check does not run every millisecond and also cannot be expected to return reasonable results fast enough to prevent down times and wrongly routed requests.

The solution lies in swarm/k8s. That is your only option. With swarm for example you can dynamically change labels, e.g. change traefik priority of a service - effectively taking it out of load balancing before stopping it.

Rafael4A commented 1 month ago

Just wanted to add that the same behavior happens with nginx-proxy. While both instances are up, requests are split into the old and the new instances instead of transferring completely to the new one as soon as it's healthy.

wowu / docker-rollout

Old instance 50% of the time #32