argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.76k stars 866 forks source link

Canary service flaps after aborted Rollout for duration of abortScaleDownDelaySeconds #3078

Open derekddecker opened 1 year ago

derekddecker commented 1 year ago

Checklist:

Describe the bug

If a release is Aborted due to a failed AnalysisRun (or manual abort), and abortScaleDownDelaySeconds is set to 0 (to keep the canary replicaset alive indefinitely), then the canary Service can end up in a weird state. The rollouts controller starts to flip/flop the pod-template-hash on the canary service. In our case (ingress-nginx traffic routing), you can observe the flapping with curl using the canary header. Sometimes it serves the live release, sometimes it serves the canary release.

The bug also seems to negatively effect the argocd-application-controller (compute shoots up, log spam) and the Argo Dashboard UI slows to a crawl.

Worth noting we are using https://github.com/argoproj-labs/rollout-extension to embed the Rollouts UI into the Argo Dashboard due to the lack of auth/rbac in the Rollouts dashboard.

Additionally we use the ApplicationSet controller. Auto-sync is disabled.

To Reproduce

Given the following rollout:

  strategy:
    canary:
      abortScaleDownDelaySeconds: 0
      canaryService: sre-demo-app-unstable-canary
      maxUnavailable: 0
      stableService: sre-demo-app-unstable
      steps:
      - setCanaryScale:
          replicas: 1
      - pause: {}
      trafficRouting:
        nginx:
          additionalIngressAnnotations:
            canary-by-header: X-Canary
            canary-by-header-value: "yes"
          stableIngress: sre-demo-app-unstable

Abort the rollout during the pause step.

Expected behavior

Up for debate, but certainly not what the current behavior is. Personally, I would like to see the canary service continue to route traffic to the canary deployment, so that further investigation can be made into the aborted Rollout.

Screenshots

Version quay.io/argoproj/argo-rollouts:v1.5.1

Logs

argocd-application-controller example logs when in this state:

time="2023-10-04T22:25:46Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sre)" application=argocd/sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="getRepoObjs stats" application=argocd/sre-demo-app-unstable build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=8 unmarshal_ms=7 version_ms=0
time="2023-10-04T22:25:46Z" level=info msg="No status changes. Skipping patch" application=argocd/sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Reconciliation completed" application=argocd/sre-demo-app-unstable dedup_ms=0 dest-name=in-cluster dest-namespace=sre dest-server="https://kubernetes.default.svc" diff_ms=11 fields.level=1 git_ms=8 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0 time_ms=30
time="2023-10-04T22:25:46Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sre)" application=argocd/sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="getRepoObjs stats" application=argocd/sre-demo-app-unstable build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=6 unmarshal_ms=6 version_ms=0
time="2023-10-04T22:25:46Z" level=info msg="No status changes. Skipping patch" application=argocd/sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Reconciliation completed" application=argocd/sre-demo-app-unstable dedup_ms=0 dest-name=in-cluster dest-namespace=sre dest-server="https://kubernetes.default.svc" diff_ms=10 fields.level=1 git_ms=6 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0 time_ms=27
time="2023-10-04T22:25:46Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd/sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: sre)" application=argocd/sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="getRepoObjs stats" application=argocd/sre-demo-app-unstable build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=6 unmarshal_ms=6 version_ms=0
time="2023-10-04T22:25:46Z" level=info msg="No status changes. Skipping patch" application=argocd/sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Reconciliation completed" application=argocd/sre-demo-app-unstable dedup_ms=0 dest-name=in-cluster dest-namespace=sre dest-server="https://kubernetes.default.svc" diff_ms=10 fields.level=1 git_ms=6 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0 time_ms=25

kubectl logs -n argocd deployment/argo-rollouts | grep rollout=sre-demo-app-unstable

time="2023-10-04T22:25:46Z" level=info msg="Switched selector for service 'sre-demo-app-unstable-canary' from '54d577fdc6' to '559597cb64'" event_reason=SwitchService namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="No changes to canary ingress - skipping patch" ingress=sre-demo-app-unstable-sre-demo-app-unstable-canary namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: false, initialDeploy: false" namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="No status changes. Skipping patch" generation=38 namespace=sre resourceVersion=239424538 rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Reconciliation completed" generation=38 namespace=sre resourceVersion=239424538 rollout=sre-demo-app-unstable time_ms=48.166238
time="2023-10-04T22:25:46Z" level=info msg="Started syncing rollout" generation=38 namespace=sre resourceVersion=239424538 rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="syncing service" namespace=sre rollout=sre-demo-app-unstable service=sre-demo-app-unstable-canary
time="2023-10-04T22:25:46Z" level=info msg="subsFromAnnotations: map[kubectl.kubernetes.io/last-applied-configuration:{\"apiVersion\":\"argoproj.io/v1alpha1\",\"kind\":\"Rollout\",\"metadata\":{\"annotations\":{},\"labels\":{\"app.kubernetes.io/instance\":\"sre-demo-app-unstable\",\"app.kubernetes.io/managed-by\":\"Helm\",\"app.kubernetes.io/name\":\"sre-demo-app\",\"app.kubernetes.io/version\":\"1.16.1\",\"helm.sh/chart\":\"sre-demo-app-0.1.1\"},\"name\":\"sre-demo-app-unstable\",\"namespace\":\"sre\"},\"spec\":{\"analysis\":{\"successfulRunHistoryLimit\":1,\"unsuccessfulRunHistoryLimit\":1},\"selector\":{\"matchLabels\":{\"app.kubernetes.io/instance\":\"sre-demo-app-unstable\",\"app.kubernetes.io/name\":\"sre-demo-app\"}},\"strategy\":{\"canary\":{\"abortScaleDownDelaySeconds\":0,\"canaryService\":\"sre-demo-app-unstable-canary\",\"maxUnavailable\":0,\"stableService\":\"sre-demo-app-unstable\",\"steps\":[{\"setCanaryScale\":{\"replicas\":1}},{\"analysis\":{\"templates\":[{\"templateName\":\"sre-demo-app-smoke-test\"}]}}],\"trafficRouting\":{\"nginx\":{\"additionalIngressAnnotations\":{\"canary-by-header\":\"X-Canary\",\"canary-by-header-value\":\"yes\"},\"stableIngress\":\"sre-demo-app-unstable\"}}}},\"template\":{\"metadata\":{\"labels\":{\"app.kubernetes.io/instance\":\"sre-demo-app-unstable\",\"app.kubernetes.io/name\":\"sre-demo-app\"}},\"spec\":{\"containers\":[{\"env\":[{\"name\":\"PORT\",\"value\":\"8000\"},{\"name\":\"SLEEP\",\"value\":\"30ms\"},{\"name\":\"VERSION\",\"value\":\"gitsha.e450c10\"}],\"image\":\"image_repo:gitsha.e450c10\",\"imagePullPolicy\":\"Always\",\"livenessProbe\":{\"failureThreshold\":6,\"httpGet\":{\"path\":\"/\",\"port\":\"sre-demo-app\"},\"periodSeconds\":3},\"name\":\"sre-demo-app\",\"ports\":[{\"containerPort\":8000,\"name\":\"sre-demo-app\",\"protocol\":\"TCP\"}],\"readinessProbe\":{\"failureThreshold\":6,\"httpGet\":{\"path\":\"/\",\"port\":\"sre-demo-app\"},\"periodSeconds\":3},\"resources\":{\"requests\":{\"cpu\":\"800m\",\"memory\":\"10Mi\"}},\"securityContext\":{}}],\"securityContext\":{},\"serviceAccountName\":\"sre-demo-app\"}}}}\n rollout.argoproj.io/revision:17]" namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Switched selector for service 'sre-demo-app-unstable-canary' from '559597cb64' to '54d577fdc6'" event_reason=SwitchService namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="syncing service" namespace=sre rollout=sre-demo-app-unstable service=sre-demo-app-unstable-canary
time="2023-10-04T22:25:46Z" level=info msg="syncing service" namespace=sre rollout=sre-demo-app-unstable service=sre-demo-app-unstable-canary
time="2023-10-04T22:25:46Z" level=info msg="subsFromAnnotations: map[kubectl.kubernetes.io/last-applied-configuration:{\"apiVersion\":\"argoproj.io/v1alpha1\",\"kind\":\"Rollout\",\"metadata\":{\"annotations\":{},\"labels\":{\"app.kubernetes.io/instance\":\"sre-demo-app-unstable\",\"app.kubernetes.io/managed-by\":\"Helm\",\"app.kubernetes.io/name\":\"sre-demo-app\",\"app.kubernetes.io/version\":\"1.16.1\",\"helm.sh/chart\":\"sre-demo-app-0.1.1\"},\"name\":\"sre-demo-app-unstable\",\"namespace\":\"sre\"},\"spec\":{\"analysis\":{\"successfulRunHistoryLimit\":1,\"unsuccessfulRunHistoryLimit\":1},\"selector\":{\"matchLabels\":{\"app.kubernetes.io/instance\":\"sre-demo-app-unstable\",\"app.kubernetes.io/name\":\"sre-demo-app\"}},\"strategy\":{\"canary\":{\"abortScaleDownDelaySeconds\":0,\"canaryService\":\"sre-demo-app-unstable-canary\",\"maxUnavailable\":0,\"stableService\":\"sre-demo-app-unstable\",\"steps\":[{\"setCanaryScale\":{\"replicas\":1}},{\"analysis\":{\"templates\":[{\"templateName\":\"sre-demo-app-smoke-test\"}]}}],\"trafficRouting\":{\"nginx\":{\"additionalIngressAnnotations\":{\"canary-by-header\":\"X-Canary\",\"canary-by-header-value\":\"yes\"},\"stableIngress\":\"sre-demo-app-unstable\"}}}},\"template\":{\"metadata\":{\"labels\":{\"app.kubernetes.io/instance\":\"sre-demo-app-unstable\",\"app.kubernetes.io/name\":\"sre-demo-app\"}},\"spec\":{\"containers\":[{\"env\":[{\"name\":\"PORT\",\"value\":\"8000\"},{\"name\":\"SLEEP\",\"value\":\"30ms\"},{\"name\":\"VERSION\",\"value\":\"gitsha.e450c10\"}],\"image\":\"image_repo:gitsha.e450c10\",\"imagePullPolicy\":\"Always\",\"livenessProbe\":{\"failureThreshold\":6,\"httpGet\":{\"path\":\"/\",\"port\":\"sre-demo-app\"},\"periodSeconds\":3},\"name\":\"sre-demo-app\",\"ports\":[{\"containerPort\":8000,\"name\":\"sre-demo-app\",\"protocol\":\"TCP\"}],\"readinessProbe\":{\"failureThreshold\":6,\"httpGet\":{\"path\":\"/\",\"port\":\"sre-demo-app\"},\"periodSeconds\":3},\"resources\":{\"requests\":{\"cpu\":\"800m\",\"memory\":\"10Mi\"}},\"securityContext\":{}}],\"securityContext\":{},\"serviceAccountName\":\"sre-demo-app\"}}}}\n rollout.argoproj.io/revision:17]" namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Switched selector for service 'sre-demo-app-unstable-canary' from '54d577fdc6' to '559597cb64'" event_reason=SwitchService namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="No changes to canary ingress - skipping patch" ingress=sre-demo-app-unstable-sre-demo-app-unstable-canary namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Skipping analysis: isAborted: true, promoteFull: false, rollbackToScaleDownDelay: false, initialDeploy: false" namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="No status changes. Skipping patch" generation=38 namespace=sre resourceVersion=239424538 rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Reconciliation completed" generation=38 namespace=sre resourceVersion=239424538 rollout=sre-demo-app-unstable time_ms=49.532897999999996
time="2023-10-04T22:25:46Z" level=info msg="Started syncing rollout" generation=38 namespace=sre resourceVersion=239424538 rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="syncing service" namespace=sre rollout=sre-demo-app-unstable service=sre-demo-app-unstable-canary
time="2023-10-04T22:25:46Z" level=info msg="subsFromAnnotations: map[kubectl.kubernetes.io/last-applied-configuration:{\"apiVersion\":\"argoproj.io/v1alpha1\",\"kind\":\"Rollout\",\"metadata\":{\"annotations\":{},\"labels\":{\"app.kubernetes.io/instance\":\"sre-demo-app-unstable\",\"app.kubernetes.io/managed-by\":\"Helm\",\"app.kubernetes.io/name\":\"sre-demo-app\",\"app.kubernetes.io/version\":\"1.16.1\",\"helm.sh/chart\":\"sre-demo-app-0.1.1\"},\"name\":\"sre-demo-app-unstable\",\"namespace\":\"sre\"},\"spec\":{\"analysis\":{\"successfulRunHistoryLimit\":1,\"unsuccessfulRunHistoryLimit\":1},\"selector\":{\"matchLabels\":{\"app.kubernetes.io/instance\":\"sre-demo-app-unstable\",\"app.kubernetes.io/name\":\"sre-demo-app\"}},\"strategy\":{\"canary\":{\"abortScaleDownDelaySeconds\":0,\"canaryService\":\"sre-demo-app-unstable-canary\",\"maxUnavailable\":0,\"stableService\":\"sre-demo-app-unstable\",\"steps\":[{\"setCanaryScale\":{\"replicas\":1}},{\"analysis\":{\"templates\":[{\"templateName\":\"sre-demo-app-smoke-test\"}]}}],\"trafficRouting\":{\"nginx\":{\"additionalIngressAnnotations\":{\"canary-by-header\":\"X-Canary\",\"canary-by-header-value\":\"yes\"},\"stableIngress\":\"sre-demo-app-unstable\"}}}},\"template\":{\"metadata\":{\"labels\":{\"app.kubernetes.io/instance\":\"sre-demo-app-unstable\",\"app.kubernetes.io/name\":\"sre-demo-app\"}},\"spec\":{\"containers\":[{\"env\":[{\"name\":\"PORT\",\"value\":\"8000\"},{\"name\":\"SLEEP\",\"value\":\"30ms\"},{\"name\":\"VERSION\",\"value\":\"gitsha.e450c10\"}],\"image\":\"image_repo:gitsha.e450c10\",\"imagePullPolicy\":\"Always\",\"livenessProbe\":{\"failureThreshold\":6,\"httpGet\":{\"path\":\"/\",\"port\":\"sre-demo-app\"},\"periodSeconds\":3},\"name\":\"sre-demo-app\",\"ports\":[{\"containerPort\":8000,\"name\":\"sre-demo-app\",\"protocol\":\"TCP\"}],\"readinessProbe\":{\"failureThreshold\":6,\"httpGet\":{\"path\":\"/\",\"port\":\"sre-demo-app\"},\"periodSeconds\":3},\"resources\":{\"requests\":{\"cpu\":\"800m\",\"memory\":\"10Mi\"}},\"securityContext\":{}}],\"securityContext\":{},\"serviceAccountName\":\"sre-demo-app\"}}}}\n rollout.argoproj.io/revision:17]" namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Switched selector for service 'sre-demo-app-unstable-canary' from '559597cb64' to '54d577fdc6'" event_reason=SwitchService namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=sre rollout=sre-demo-app-unstable
time="2023-10-04T22:25:46Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=sre rollout=sre-demo-app-unstable

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 60 days with no activity.