elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
2.57k stars 695 forks source link

Improve ECK upgrade rollout #3479

Open pebrc opened 4 years ago

pebrc commented 4 years ago

Related https://github.com/elastic/cloud-on-k8s/issues/479

Currently the responsibility for ECK upgrade rollouts is completely up to the user. We provide no easy way of canarying a new ECK version on a subset of resources or controlling the handover from the old version to the new version in any other way but effectively turning off reconciliation for a time to avoid many rolling restarts across the fleet of managed Elasticsearch clusters. See https://www.elastic.co/guide/en/cloud-on-k8s/1.2/k8s-upgrading-eck.html#k8s-beta-to-ga-rolling-restart

We should investigate whether we can improve the upgrade rollout story for ECK. One idea would be to opt into a canary of a new ECK version for a subset of resources e.g. by labeling the resources with the new controller version. For that to work we would need to make sure that two ECK versions could co-exist without affecting each other etc.

The output of this effort should be:

anyasabo commented 4 years ago

One idea would be to opt into a canary of a new ECK version for a subset of resources e.g. by labeling the resources with the new controller version. For that to work we would need to make sure that two ECK versions could co-exist without affecting each other etc.

I'm +1 on adding a label selector to the operator config. With the selector the informer wouldn't even process events for resources without that label, so we shouldn't need to worry about simultaneous reconciles from different ECK instances.

vasilievip commented 4 years ago

We are using netflix kayenta as service to automatically analyze canary deployments of stateless services When making changes to running cluster we want to avoid downtime if changes are breaking cluster

kayenta

So, my input here - when applying changes operator should support canary deployment and there should be ability to programmatically continue or rollback deployment via deployment pipeline based on data from canary analysis which pipeline will fetch from kayenta (or any other canary analysis automation service, kayenta is more mature than other at this point)

https://github.com/helm/helm/issues/6572