Closed mortrevere closed 4 years ago
Hey @mortrevere did you get anywhere with this? I am also looking to do the same thing and would really like to know how you got on 😄 Thanks
Hi @JackTreble, I think this is not in the Flagger philosophy/roadmap so I ended up creating a little something that does exactly this.
It will be open sourced soon, but the logic is quite simple : on init, do as Flagger does and duplicate the original deployment into a -primary
, scale the original to 0. Then you watch changes to this deployment with a simple loop + diff. When it detects changes, you duplicate the new "original" into a -canary
deployment, and slowly scale this one up (and the -primary
down) while watching Prometheus metrics every N seconds. If it fails, just scale the -primary
back to the original number of replicas, and the -canary
to 0.
All of this runs in a pod with the proper service account/role to call the k8s API, using the python client.
Config file looks like this (example for a logstash worker):
prometheus-base-url: http://xxxxxxx/prometheus
namespace: xxxx
logstash-service:
breakpoint: 50%
step: 10%
abort: 120s
max_step_duration: 600s
check_max_failures: 4
check_success_step_duration: 120s
success:
- expr: rate(logstash_events_out{kubernetes_pod_name="<<pod>>",app="logstash-service"}[1m]) > 500
That <<pod>>
tag is replaced by the -canary
pod names, so you can automatically validate that they are properly working, and you can combine multiple metrics too. Here, it only checks that the logstash pod processes at least 500 messages/s.
All in all it's only 490 lines of python and took us 3 days but it's incredibly useful. I wish Flagger would cover this (as it is pretty simple) but you can't have everything ...
Hey @JackTreble, its out in case you still need it :
While performing a blue/green deployment, it seems impossible to run the load-test against multiple instances of the service, deploying the new version in a progressive way.
Let me explain : we have worker services (no HTTP traffic incoming, only consuming data from Kafka) that we would like to deploy in a canary-like approach. Let's say
worker1
has 40 instances running and we are deploying a new version. How could we replace a single instance with the new version, observe how it is performing based on some Prometheus metrics, and progressively rollout the new version while observing how the updated set of instance performs ? And possibly rollback the deployment if it fails to meet the success criteria at some point (like 50% deployed) ?What I am able to get now is a standard blue/green deployment, with an additional single instance of
worker1
being tested, and rolled-out at once after N successful iterations. Canary deployments won't work as it requires a service mesh, and we are not directing HTTP traffic to these pods.Any idea on how to tackle this using Flagger ?