argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.78k stars 872 forks source link

argo rollout get stuck in "Processing" after abort #2048

Closed cdlliuy closed 2 years ago

cdlliuy commented 2 years ago

Summary

Argo rollout get stuck in "processing" after abort

image

I first deployed revision1 successfully. then with revision2 (assuming as a bad build), I manually "abort" it. Then, I updated the image for revision 3 (assuming as a fixed build)

I expected the revision3 is rolling out along with revision 1, but it didn't. The replicas of revision3 is scaled down directly after deployment.

My rollout definition:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  namespace: argo-rollouts
  name: helloworld-demo
  labels:
    app: helloworld-demo
    chart: argo-rollouts-helloworld  
spec:
  replicas: 5
  strategy:
    canary:
      maxSurge: 100%
      maxUnavailable: 25%
      canaryMetadata:
        labels:
          app: helloworld-demo
          role: helloworld-demo-canary
      # metadata which will be attached to the stable pods
      stableMetadata:
        labels:
          app: helloworld-demo
          role: helloworld-demo-stable
      steps:
      - setWeight: 20
      - pause: {}
   revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: helloworld-demo
  template:
    metadata:
      labels:
        app: helloworld-demo
    spec:
      shareProcessNamespace: true
      containers:
      - name: argo-rollouts-helloworld-demo
        image: <image-name>
        imagePullPolicy: Always
        env:
        - name: target
          value: test3

The issue happened when "CanaryMetadata" & "StableMetadata" attached. If I removed them, the behavior is good.

Diagnostics

1.2.0

Paste the logs from the rollout controller

The log below states that the invalid label "rollouts-pod-template-hash", but this label is attached by argo itself, correct?

Logs for the entire controller:

time="2022-05-19T04:52:59Z" level=info msg="Started syncing rollout" generation=6 namespace=argo-rollouts resourceVersion=83100886 rollout=helloworld-demo
time="2022-05-19T04:52:59Z" level=error msg="roCtx.reconcile err ReplicaSet.apps \"helloworld-demo-5f5684759c\" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{\"rollouts-pod-template-hash\":\"5f5684759c\"}: `selector` does not match template `labels`" generation=6 namespace=argo-rollouts resourceVersion=83100886 rollout=helloworld-demo
time="2022-05-19T04:52:59Z" level=info msg="Reconciliation completed" generation=6 namespace=argo-rollouts resourceVersion=83100886 rollout=helloworld-demo time_ms=21.291507000000003
time="2022-05-19T04:52:59Z" level=error msg="rollout syncHandler error: ReplicaSet.apps \"helloworld-demo-5f5684759c\" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{\"rollouts-pod-template-hash\":\"5f5684759c\"}: `selector` does not match template `labels`" namespace=argo-rollouts rollout=helloworld-demo
time="2022-05-19T04:52:59Z" level=info msg="rollout syncHandler queue retries: 16 : key \"argo-rollouts/helloworld-demo\"" namespace=argo-rollouts rollout=helloworld-demo
E0519 04:52:59.403139       1 controller.go:174] ReplicaSet.apps "helloworld-demo-5f5684759c" is invalid: spec.template.metadata.labels: Invalid value: map[string]string{"rollouts-pod-template-hash":"5f5684759c"}: `selector` does not match template `labels`

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

harikrongali commented 2 years ago

@cdlliuy in the logs there is an issue with labels. in 1.2.0, there is an extra validation added for matching labels as opposed to 1.1.o where it threw a warning.

cdlliuy commented 2 years ago

@harikrongali ,
do you mean I need to downgrade to 1.1.x to fix the issue?
is that an regression bug in 1.2.x? any plan to fix it?

harikrongali commented 2 years ago

no this is not a bug but validation is now returning as error instead of warning. Seems like replicaSet for revision2 is invalid. You can delete revision2 and rollout will progress.

cdlliuy commented 2 years ago

@harikrongali , but the wrong label reported in the log is rollouts-pod-template-hash":"5f5684759c"}` which is created by rollout directly, it is not a user error.

Also, for "delete revision2", is there any simple way handling by rollout directly? Should the rollout user setup a separate script to detect which replicaset is wrong, and manually delete it?