Progressive Blue/Green deployment

mortrevere commented 4 years ago

While performing a blue/green deployment, it seems impossible to run the load-test against multiple instances of the service, deploying the new version in a progressive way.

Let me explain : we have worker services (no HTTP traffic incoming, only consuming data from Kafka) that we would like to deploy in a canary-like approach. Let's say worker1 has 40 instances running and we are deploying a new version. How could we replace a single instance with the new version, observe how it is performing based on some Prometheus metrics, and progressively rollout the new version while observing how the updated set of instance performs ? And possibly rollback the deployment if it fails to meet the success criteria at some point (like 50% deployed) ?

What I am able to get now is a standard blue/green deployment, with an additional single instance of worker1 being tested, and rolled-out at once after N successful iterations. Canary deployments won't work as it requires a service mesh, and we are not directing HTTP traffic to these pods.

Any idea on how to tackle this using Flagger ?

JackTreble commented 4 years ago

Hey @mortrevere did you get anywhere with this? I am also looking to do the same thing and would really like to know how you got on 😄 Thanks

mortrevere commented 4 years ago

Hi @JackTreble, I think this is not in the Flagger philosophy/roadmap so I ended up creating a little something that does exactly this.

It will be open sourced soon, but the logic is quite simple : on init, do as Flagger does and duplicate the original deployment into a -primary, scale the original to 0. Then you watch changes to this deployment with a simple loop + diff. When it detects changes, you duplicate the new "original" into a -canary deployment, and slowly scale this one up (and the -primary down) while watching Prometheus metrics every N seconds. If it fails, just scale the -primary back to the original number of replicas, and the -canary to 0.

All of this runs in a pod with the proper service account/role to call the k8s API, using the python client.

Config file looks like this (example for a logstash worker):

    prometheus-base-url: http://xxxxxxx/prometheus
    namespace: xxxx

    logstash-service:
      breakpoint: 50%
      step: 10%
      abort: 120s
      max_step_duration: 600s
      check_max_failures: 4
      check_success_step_duration: 120s
      success:
        - expr: rate(logstash_events_out{kubernetes_pod_name="<<pod>>",app="logstash-service"}[1m]) > 500

That <<pod>> tag is replaced by the -canary pod names, so you can automatically validate that they are properly working, and you can combine multiple metrics too. Here, it only checks that the logstash pod processes at least 500 messages/s.

All in all it's only 490 lines of python and took us 3 days but it's incredibly useful. I wish Flagger would cover this (as it is pretty simple) but you can't have everything ...

mortrevere commented 4 years ago

Hey @JackTreble, its out in case you still need it :

https://github.com/SEKOIA-IO/aviary

fluxcd / flagger

Progressive Blue/Green deployment #622