Summary

Synchronization barrier for deploying to multiple clusters in parallel, especially in multiple regions. Implemented as a custom inline canary analysis step which waits until another cluster reaches the same step.

Motivation

Many organizations maintain infrastructure in multiple geographical regions. This improves reliability, as when one region goes out of service, production services are still available to users in the other region. However, deploying to multiple regions necessitates deploying to multiple Kubernetes clusters. Argo Rollouts currently doesn't orchestrate deployment to multiple clusters.

Some organizations deploy to different regions in a sequence, i.e. deploy to one region, test, deploy to another region, and test again. With this pattern, the first region is treated essentially like a canary. However, this approach defeats the purpose of having multiple regions in the first place, because if the second region from the sequence goes out of service during deployment, all users will be served the new, potentially unstable version from the first region. This is not merely a theoretical risk, as it's not uncommon to test a canary version for a few days before it's released to all users.

Furthermore, if one of two regions is out of service for an extended period of time, a deployment to the remaining region may be required. However, in this case it's impossible to test the new version using the "regional canary" pattern described above, so the engineer won't be able to reap the benefits of a canary deployment.

A better approach is to use the canary deployment feature of Argo Rollouts. With this pattern, a canary pod is deployed to a given cluster and a percentage of traffic is routed to that pod. This pattern has the advantage that even if one of the regions goes down, it's still only a small percentage of traffic that gets served the canary version. Furthermore, the engineer can use canary deployment and testing even if only one region is available.

With the pattern above, it's no longer necessary to deploy to different regions in a sequence and the engineer can deploy to all regions in parallel, significantly reducing the deployment time (also known as the lead time to production). Improving that metric is important both for developer experience and site health, because it means that critical bugfixes reach the end user quicker.

However, when deploying to multiple regions in parallel, it's important to make sure that all regions wait until canary analysis finishes in all regions. This is where the synchronization barrier comes in. The barrier is a special Rollout step which waits until all regions reach that step. Such a barrier can be put after the canary analysis inline step, so that each region waits until canary analysis is complete.

Note that it's not necessary to run canary analysis in each region – doing so may be useful in some setups and not in others. This is why it should also be possible to have only one region perform canary analysis, while the others idly wait. In order to keep the implementation simple, we require that the synchronized Rollouts have the same number of steps. In order to achieve that, we sometimes need to use a no-operation step. This is illustrated in the diagram below:

flowchart TD;
  c0title[["`REGION 1`"]]
  c0step0("`deploy canary pod
_(setWeight: 1)_`")
  c0step1("`run canary analysis
_(inline analysis step,
with multiple template
checks in parallel, 
e.g. error rate and 
response time)_`")
  c0step2("`sync barrier
_(inline analysis step
waiting for
all regions)_`")
  c0step3("`deploy 100% pods
_(setWeight: 100)_`")
  c0title ~~~ c0step0
  c0step0 --> c0step1
  c0step1 --> c0step2
  c0step2 --> c0step3

  c1title[["`REGION 2`"]]
  c1step0("`do nothing
_(inline analysis step
not doing anything)_`")
  c1step1("`do nothing
_(inline analysis step
not doing anything)_`")
  c1step2("`sync barrier
_(inline analysis step
waiting for
all regions)_`")
  c1step3("`deploy 100% pods
_(setWeight: 100%)_`")
  c1title ~~~ c1step0
  c1step0 --> c1step1
  c1step1 --> c1step2
  c1step2 --> c1step3

Proposal

The implementation described below has been used for around nine months in production at Priceline.com.

Use Cases

Organizations that run production workloads in multiple Kubernetes clusters (such as in multiple regions), want to reap the site-health advantages of canary deployments using Argo Rollouts, and would like to parallelize deployments to clusters in order to minimize the lead time to production without sacrificing quality.

Security Considerations

The synchronization barrier requires different clusters to obtain information about each other. This needs to be implemented securely. At Priceline, we are using an intermediary server for communication between the clusters for additional security.

Risks and Mitigations

An important problem needs to be resolved in the implementation. There is a risk of a deadlock if one region still hasn't finished syncing to an older commit while another has already started syncing to a newer commit. In this scenario, the first region may be waiting for the second one to finish, while the second one is still waiting for the first one to finish. We resolve this problem by introducing a new Rollout annotation called argoproj.io/sequential-number, which increases with each commit.

At Priceline, a GitHub Workflow populates this number before pushing a commit to the main branch. However, a more portable implementation would be to have Argo CD generate this number instead. At Priceline, the number corresponds to the number of commits since the beginning of the observed branch. However, it doesn't have to, as long as the number always increases with each commit. I am open to feedback and suggestions on how to implement this number.

Goals

Provide a rollout step that implements the synchronization barrier
Provide a rollout step for no-operation
Implement auto-generation of the argoproj.io/sequential-number Rollout annotation
Implement the intermediary server for cluster-to-cluster communication
Implement token generation and management in Helm and Kustomize

Implementation

First, we introduce the following convention: if an engineer wants to deploy to multiple clusters in parallel, they have to make the changes in a single commit.

Second, we implement the synchronization barrier as a web analysis template. We use the template to talk to the intermediary server, which talks to the Kubernetes API of the other clusters.

The source code of the analysis template looks as follows:

{{- $argoSyncBarrierHost := .Values.argoSyncBarrierHost -}}
{{ range $target := .Values.targets }}
---
kind: ClusterAnalysisTemplate
apiVersion: argoproj.io/v1alpha1
metadata:
  name: sync-barrier.{{$target}}
spec:
  args:
  - name: argo-sync-barrier.token.{{$target}}
    valueFrom:
      secretKeyRef:
        name: argo-sync-barrier-secret
        key: {{$target}}.token
  - name: namespace.{{$target}}
  - name: rolloutName.{{$target}}
  - name: sequentialNumber
  - name: currentStepIndex
  - name: interval
    value: 30s
  metrics:
  - name: {{ printf "'Wait for %s'" ($target) }}
    provider:
      web:
        url: {{ printf "'https://%s/api/argo-sync-barrier/namespace/{{args.namespace.%s}}/rollout/{{args.rolloutName.%s}}'" ($argoSyncBarrierHost) ($target) ($target) }}
        timeoutSeconds: 120
        headers:
        - key: X-Argo-Sync-Barrier
          value: {{$target}}
        - key: Authorization
          value: {{ printf "'Bearer {{args.argo-sync-barrier.token.%s}}'" ($target) }}
        jsonPath: "{$}"
    successCondition: "false"
    failureCondition: "{{ "asInt(result.metadata.annotations['argoproj.io/sequential-number']) < 1 || asInt({{args.sequentialNumber}}) < 1 || asInt(result.metadata.annotations['argoproj.io/sequential-number']) < asInt({{args.sequentialNumber}}) || (asInt(result.metadata.annotations['argoproj.io/sequential-number']) == asInt({{args.sequentialNumber}}) && result.status.currentStepIndex < {{args.currentStepIndex}})" }}"
    count: 2147483647  # MaxInt32
    interval: {{ "'{{args.interval}}'" }}
    inconclusiveLimit: 0
    failureLimit: 2147483646  # MaxInt32 - 1
    consecutiveErrorLimit: 2147483646  # MaxInt32 - 1
{{ end }}

The generation and management of the tokens need to be discussed.

Examples

Please find below an example of a Rollout definition:

metadata:
  labels:
    application: my-app
    cluster: cluster-name-1
spec:
  strategy:
    canary:
      steps:
        - setWeight: 1
        - analysis:
            analysisRunMetadata: {}
            args:
              - name: metricName
                value: 'Automatic canary verification'
              - name: application
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['application']
              - name: cluster
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['cluster']
              - name: stablePodHash
                valueFrom:
                  podTemplateHashValue: Stable
              - name: latestPodHash
                valueFrom:
                  podTemplateHashValue: Latest
            templates:
              - clusterScope: true
                templateName: canary.response-time.verification
              - clusterScope: true
                templateName: canary.error-rate.verification
        - analysis:
            analysisRunMetadata: {}
            args:
              - name: namespace.cluster-name-2
                value: my-ns-2
              - name: namespace.cluster-name-3
                value: my-ns-3
              - name: rolloutName.cluster-name-2
                value: my-app-2
              - name: rolloutName.cluster-name-3
                value: my-app-3
              - name: sequentialNumber
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['argoproj.io/sequential-number']
              - name: currentStepIndex
                valueFrom:
                  fieldRef:
                    fieldPath: status.currentStepIndex
            dryRun:
              - metricName: .*
            templates:
              - clusterScope: true
                templateName: sync-barrier.cluster-name-2
              - clusterScope: true
                templateName: sync-barrier.cluster-name-3
        - setWeight: 100

Upgrade/Downgrade Strategy

There is no impact for users who don't use this feature.

Drawbacks

Currently unknown ;)

Alternatives

Given that the synchronization has to occur during the rollout, implementing the synchronization barrier logic as a rollout step is a natural choice.

Message from the maintainers:

Impacted by this missing feature? Give it a 👍. We prioritize the issues with the most 👍.

argoproj / argo-rollouts

Synchronization barrier for deploying to multiple clusters in parallel #3770