SteveLTN / https-portal

A fully automated HTTPS server powered by Nginx, Let's Encrypt and Docker.
MIT License
4.46k stars 295 forks source link

Zero Downtime Deploys #304

Open cameronjeffords opened 2 years ago

cameronjeffords commented 2 years ago

Hey There,

I'm starting to use https-portal, and it's great - thanks for all the work and support you put in to it.

I'm relatively new to nginx and networking in general, but I wanted to make sure I was understanding some of the functionality https-portal offers.

My goal is to have zero-downtime deployments to our development, beta, and production node servers, as we will have clients with active websocket connections.

Currently, we have about 20 containers that https-portal serves, including development, beta, and production environments:

volumes:
  https-portal-data:

services:
  https-portal:
    image: steveltn/https-portal:1.19.2
    restart: always
    ports:
      - 80:80
      - 443:443
    links:
      - s1dev
      - s1beta
      - s1prod
      - s2dev
      - s2beta
      - s2prod
      - s3dev
      - s3beta
      - s3prod     
      - ...
    volumes:
      - https-portal-data:/var/lib/https-portal
    environment:
      DOMAINS: >
        s1dev.domain.com -> http://s1dev,
        s1beta.domain.com -> http://s1beta,
        s1prod.domain.com -> http://s1prod,
        s2dev.domain.com -> http://s2dev,
        s2beta.domain.com -> http://s2beta,
        s2prod.domain.com -> http://s2prod,
        s3dev.domain.com -> http://s3dev,
        s3beta.domain.com -> http://s3beta,
        s3prod.domain.com -> http://s3prod,
        ...,
      STAGE: production
      WEBSOCKET: 'true'
      RESOLVER: 127.0.0.11 ipv6=off valid=60s
      DYNAMIC_UPSTREAM: 'true'
      WORKER_PROCESSES: 'auto'
      WORKER_CONNECTIONS: '1000000'
      PROXY_CONNECT_TIMEOUT: '600'
      PROXY_SEND_TIMEOUT: '600'
      PROXY_READ_TIMEOUT: '600'
      CLIENT_MAX_BODY_SIZE: 1000000M
  s1dev:
    image: api:1
    logging:
      driver: 'json-file'
      options:
        max-size: '50m'
        max-file: '1'
    build:
      context: .
      dockerfile: ./api/Dockerfile
    restart: always
...

Right now, we restart https-portal on every deploy. Below is an example for a development deployment:

docker-compose --project-name backend up --build --remove-orphans -d s1dev s2dev s3dev
docker-compose --project-name backend restart https-portal

I was under the impression that by enabling DYNAMIC_UPSTREAM=true we could resolve each container's IP with the container ID itself, rather than it's docker network IP, so when the container is re-created and it's IP possibly re-assigned, it wouldn't cause any issues. I guess behind the scenes, this would be accomplished with an nginx reload? I noticed however, that when testing deployments without restarting https-portal, on occasion, the host would not be found (It's worth noting that other times, it worked perfectly). Sometimes this appeared to be a caching issue, manifesting with a 'swap' of service/IPs (i.e., a request to s1dev gets directed to s2dev), and then resolving correctly within ~30s or so. Other times, it seemed to be permanent, and every request to s1dev results in s1dev could not be resolved (3: Host not found). I also tried enabling Automatic Container Discovery, to no avail.

Apologies if I'm missing something obvious, or mis-understanding the capabilities here, but I wanted to be sure my expectations aligned with what is possible.

Thanks in advance.

SteveLTN commented 2 years ago

DYNAMIC_UPSTREAM does not really send any reload signals to Nginx. It merely tells Nginx don't cache DNS results. I'm afraid this is more of Docker's issue, not Nginx's . Unfortunately I'm not very familiar with Docker's DNS mechanism.

I would probably start debugging by trying to shell into the HTTPS-PORTAL contain, and find out if the restarted services are ping-able.

cameronjeffords commented 2 years ago

So overall, should I expect that it's possible to have zero-downtime deploys to our containers? Maybe it'll require a little more legwork on our end though?

  1. For the Host not found issue, the container was curl-able from the https-portal container via its ipv4 address, but not by it's service_name.
  2. For the issue where requests to sdev1 get temporarily routed to sdev2's server, am I right that this is a DNS caching issue? I guess I could always create a custom network and assign static IPs to each service, but I would rather not resort to that if possible
SteveLTN commented 2 years ago

I think it is possible. But you do need to do some legwork on your own.

  1. Then there is some issue with Docker's internal DNS resolving. I think this should be solvable, but don't know how.
  2. I would guess so. Again, if you curl if from HTTPS-PORTAL's container, you can test it so you can be sure.
MarcelWaldvogel commented 2 years ago

I do run it in a production environment as well.

  1. I use the dynamic-env feature, so I can modify environment variables (most notably DOMAINS) at run-time and there will be a "soft reload"
  2. I use separate docker networks (i.e., the default "one docker-compose.yml per directory" setup), with those nodes having an exposed/mappedport(e.g. 8910 for s1dev) to whichhttps-portalconnects (i.e.s1dev.domain.com -> http://dockerhost:8910;dockerhost` is a https-portal magic name, pointing at the internal IP address of the docker host system)
  3. I manually specify the multiple backends, if there are any, either using s1prod.domain.com -> http://dockerhost:8910|http://server2:8910 or, when the same upstream hosts are used multiple times, using an Nginx upstream entry in CUSTOM_NGINX_GLOBAL_HTTP_CONFIG_BLOCK.