Implementing Disaster Recovery for an EKS Cluster

Please complete the following fields to help us fully understand the nature of your proposal.

Which format is the content best suited to?

Content Type:

Blog Post

Which AWS container service(s) does it cover?

AWS Container Service(s):

Amazon EKS

What will readers learn from the content?

Elevator Pitch:

Readers will learn how they implement Disaster Recovery for their EKS clusters by syncing two different EKS cluster across two or more regions in either Active/Active, Active/Warm or Active/Standby DR modes.

Content Outline/Description

Outline:

Currently there is no well defined mechanism on how Disaster Recovery can be implemented for EKS Cluster and there is no native capability that can help the customer achive the same. The aim of this paper is provide a mechanism for disaster recovery where changes to primary cluster will be replicated across one or more clusters in different regions. The scope of the blog is to replication of deployments and deployment related state to other clusters and not the state of the Pods running in those cluster which will be workload specific.

The blog will show how we can leverage Dynamic Admission Controllers to synchronize deployments across multiple region. A customer admission controller which will intercepts deployments of services, pods, secrets etc. and will trigger the same deployment to clusters deployed in other regions by calling relevant Kubernetes APIs in the other clusters asynchronously.

The blog will show how we operate the other cluster in Active/Active mode in which the desired capacity of the Pods will get created in the replicated clusters, Active/Warm mode in which case the only the minimum would be created, Active/Standby mode where node will be added to temporarily deploy the Pods on the target cluster and then scaled down to zero so as to avoid incuring costs.

The blog will also show how we can leverage Route53 to switch between active and DR clusters.

awslabs / container-content-ideas-for-aws