Cloud Native Change Management and Control

jinyalong commented 1 year ago

I want to implement cloud-native change management based on K8S. This involves monitoring deployment updates, verifying new pod versions, and combining cloud-native monitoring components, such as Prometheus, to obtain pod metrics and time-series data. Then, using our own intelligent anomaly detection algorithm, we can detect any anomalies in pod metrics and rollback changes in a timely manner. This process is similar to the deployment of Argo-rollouts, but instead of directly operating on the ReplicaSet extension capability, we want to develop a Workload change control framework to extend change management in the cloud-native domain. We have defined two core CRDs, “ChangeWorkload” and “ChangePod”. We have developed our own Operator to sense changes in the cloud-native environment, and a SpringBoot control-side application that returns all verification logic to the Operator application.

Regarding the process, we plan to start with deployment and divide it into the following stages:

Change awareness: When the podTemplate of a deployment changes, it is regarded as a new version online and will be processed. Pre-change verification: Perform some admission verifications based on a webhook mechanism, providing rule configuration verification capabilities such as time window restrictions, such as not allowing changes at midnight. Change execution blocking: kubectl rollout pause deployment When the control-side application detects a pod anomaly, the change is blocked directly through the API server. Post-change verification: Support customized verification after the change is completed. Change self-healing: Directly call the deployment rollback operation via the API server when a version exception is detected. Do you have any better ideas for the technical solutions for change management in a cloud-native environment? We welcome any suggestions or feedback.

zachaller commented 1 year ago

I think if I am understanding this, a step plugin might help along with an analysis plugin maybe?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.

argoproj / argo-rollouts

Cloud Native Change Management and Control #2736