argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.76k stars 866 forks source link

Old revision replicaset not scaled down to 0 pod #2793

Closed ngonemettle closed 1 year ago

ngonemettle commented 1 year ago

Checklist:

Describe the bug After following all steps for a canary version to be promoted stable, argo rollouts still keep pods from the older stable version. Screenshot 2023-05-18 at 15 23 02

To Reproduce Here are the rollout, virtual service and destination rule manifest applied:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  labels:
    app.kubernetes.io/instance: test-app
    app.kubernetes.io/name: test-app
    app.kubernetes.io/part-of: test-app
    app_name: test-app
  name: test-app
  namespace: test
spec:
  replicas: 2
  selector:
    matchLabels:
      app_name: test-app
  strategy:
    canary:
      canaryMetadata:
        labels:
          release-type: canary
      stableMetadata:
        labels:
          release-type: stable
      steps:
      - setWeight: 10
      - pause:
          duration: 15m
      - setWeight: 25
      - pause:
          duration: 10m
      - setWeight: 50
      - pause:
          duration: 10m
      - setWeight: 75
      - pause:
          duration: 10m
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: canary
            name: test-app
            stableSubsetName: stable
          virtualService:
            name: test-app
  template:
    metadata:
      annotations:
        ad.datadoghq.com/test-app.logs: |-
          [{
              "source": "java",
              "log_processing_rules": []
          }]
        prometheus.io/path: /metrics
        prometheus.io/port: "8081"
        prometheus.io/scrape: "true"
      labels:
        app_name: test-app
        sidecar.istio.io/inject: "true"
        tags.datadoghq.com/service: test-app
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app_name
                  operator: In
                  values:
                  - test-app
              topologyKey: topology.kubernetes.io/zone
            weight: 100
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app_name
                operator: In
                values:
                - test-app
            topologyKey: kubernetes.io/hostname
      containers:
        image: eu.gcr.io/test/test-app:dev-bfe2da48af7ef4dd5352bbbc32331c7125d25e54-1683279237
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
              - sleep
              - "10"
        livenessProbe:
          failureThreshold: 24
          httpGet:
            path: /health
            port: 8081
          initialDelaySeconds: 60
          timeoutSeconds: 30
        name: test-app
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 24
          httpGet:
            path: /health
            port: 8081
          initialDelaySeconds: 60
          timeoutSeconds: 30
        resources:
          limits:
            memory: 1400Mi
          requests:
            cpu: 200m
            memory: 1Gi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          seccompProfile:
            type: RuntimeDefault
      imagePullSecrets:
      - name: gcr-private-image-pull-secret
      initContainers:
      - image: eu.gcr.io/test/test:1.1.265
        imagePullPolicy: IfNotPresent
        name: wait-for-schema-registry
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
          runAsGroup: 1337
          runAsUser: 1337
          seccompProfile:
            type: RuntimeDefault
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsNonRoot: true
        runAsUser: 1000
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: test-app

---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  labels:
    app_name: test-app
  name: test-app
  namespace: test
spec:
  host: test-app.test.svc.cluster.local
  subsets:
    - labels:
        app_name: test-app
      name: stable
    - labels:
        app_name: test-app
      name: canary

---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  labels:
    app_name: test-app
  name: test-app
  namespace: test
spec:
  hosts:
    - test-app.test.svc.cluster.local
  http:
    - route:
        - destination:
            host: test-app.test.svc.cluster.local
            subset: stable
          weight: 100
        - destination:
            host: test-app.test.svc.cluster.local
            subset: canary
          weight: 0
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:

  name: test-app
  namespace: test
spec:
  host: test-app.test.svc.cluster.local
  subsets:
    - labels:
        app_name: test-app
      name: stable
    - labels:
        app_name: test-app
      name: canary

Expected behavior After a complete promotion of a canary release only keep pods from this version remove old pods from old revisions.

Version We see this issue on the argo-rollout helm chart version 2.22.1 and also version 2.28.0

Logs from argo rollout controller

{"generation":32,"level":"error","msg":"roCtx.reconcile err Pod \"test-6d87b88f4d-sjhgm\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \tVolumes:        
\"eu.gcr.io/mettle-bank/perimener:1.1.265\", Env: {{Name: \"EXPECTED_READY_POD_COUNT\", Value: \"1\"}, {Name: 

}\n","namespace":"test","resourceVersion":"873342312","rollout":"test","time":"2023-05-18T14:41:42Z"}

Logs from eks

{
    "id": "AgAAAYgvU-TNhL_qmAAAAAAAAAAYAAAAAEFZZ3ZVX0VyQUFDc01CLTMyd016X0FBYwAAACQAAAAAMDE4ODJmNTQtMWExYy00MjFhLTkzMzAtMmQzZDYyY2MzOTRj",
    "content": {

        "service": "eks",
        "attributes": {
            "stageTimestamp_ms": 1684421207245,
            "annotations": {
                "authorization": {
                    "k8s": {
                        "io/reason": "RBAC: allowed by ClusterRoleBinding \"argo-rollouts\" of ClusterRole \"argo-rollouts\" to ServiceAccount \"argo-rollouts/argo-rollouts\"",
                        "io/decision": "allow"
                    }
                },
                "patch": {
                    "webhook": {
                        "admission": {
                            "k8s": {
                                "io/round_0_index_11": "{\"configuration\":\"kyverno-resource-mutating-webhook-cfg\",\"webhook\":\"mutate.kyverno.svc-fail\",\"patch\":[{\"op\":\"remove\",\"path\":\"/spec/containers/0/env/7\"},{\"op\":\"remove\",\"path\":\"/spec/containers/0/env/6\"},{\"op\":\"remove\",\"path\":\"/spec/containers/0/env/5\"},{\"op\":\"remove\",\"path\":\"/spec/containers/0/env/1\"},{\"op\":\"remove\",\"path\":\"/spec/containers/1/env/7\"},{\"op\":\"remove\",\"path\":\"/spec/containers/1/env/6\"},{\"op\":\"remove\",\"path\":\"/spec/containers/1/env/5\"},{\"op\":\"remove\",\"path\":\"/spec/containers/1/env/1\"},{\"op\":\"remove\",\"path\":\"/status/conditions/0/lastProbeTime\"},{\"op\":\"remove\",\"path\":\"/status/conditions/1/lastProbeTime\"},{\"op\":\"remove\",\"path\":\"/status/conditions/2/lastProbeTime\"},{\"op\":\"remove\",\"path\":\"/status/conditions/3/lastProbeTime\"},{\"op\":\"add\",\"path\":\"/spec/containers/0/env/0\",\"value\":{\"name\":\"_HOST_IP\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"status.hostIP\"}}}},{\"op\":\"add\",\"path\":\"/spec/containers/0/env/3\",\"value\":{\"name\":\"OTEL_SERVICE_NAME\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.labels['app.kubernetes.io/instance']\"}}}},{\"op\":\"add\",\"path\":\"/spec/containers/0/env/5\",\"value\":{\"name\":\"OTEL_EXPORTER_OTLP_ENDPOINT\",\"value\":\"http://$(_HOST_IP):4317\"}},{\"op\":\"add\",\"path\":\"/spec/containers/0/env/7\",\"value\":{\"name\":\"OTEL_EXPORTER_OTLP_PROTOCOL\",\"value\":\"grpc\"}},{\"op\":\"add\",\"path\":\"/spec/containers/1/env/0\",\"value\":{\"name\":\"_HOST_IP\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"status.hostIP\"}}}},{\"op\":\"add\",\"path\":\"/spec/containers/1/env/3\",\"value\":{\"name\":\"OTEL_SERVICE_NAME\",\"valueFrom\":{\"fieldRef\":{\"fieldPath\":\"metadata.labels['app.kubernetes.io/instance']\"}}}},{\"op\":\"add\",\"path\":\"/spec/containers/1/env/5\",\"value\":{\"name\":\"OTEL_EXPORTER_OTLP_ENDPOINT\",\"value\":\"http://$(_HOST_IP):4317\"}},{\"op\":\"add\",\"path\":\"/spec/containers/1/env/7\",\"value\":{\"name\":\"OTEL_EXPORTER_OTLP_PROTOCOL\",\"value\":\"grpc\"}},{\"op\":\"replace\",\"path\":\"/metadata/annotations/policies.kyverno.io~1last-applied-patches\",\"value\":\"add-datadog-env.add-datadog-labels.kyverno.io: removed /status/conditions/3/lastProbeTime\\nadd-default-env-vars.otel-agent-add-default-env-vars.kyverno.io: added /spec/containers/1/env/7\\n\"}],\"patchType\":\"JSONPatch\"}"
                            }
                        }
                    }
                },
                "mutation": {
                    "webhook": {
                        "admission": {
                            "k8s": {
                                "io/round_0_index_11": "{\"configuration\":\"kyverno-resource-mutating-webhook-cfg\",\"webhook\":\"mutate.kyverno.svc-fail\",\"mutated\":true}"
                            }
                        }
                    }
                },
                "apiserver": {
                    "latency": {
                        "k8s": {
                            "io/response-write": "102.207µs",
                            "io/serialize-response-object": "146.788µs",
                            "io/total": "885.281827ms",
                            "io/etcd": "1.956717ms",
                            "io/mutating-webhook": "869.857905ms"
                        }
                    }
                }
            },
            "duration": 885000000,
            "responseObject": {
                "reason": "Invalid",
                "apiVersion": "v1",
                "code": 422,
                "kind": "Status",
                "details": {
                    "kind": "Pod",
                    "causes": [
                        {
                            "reason": "FieldValueForbidden",
                            "field": "spec",
                            "message": "Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \tVolumes:        {{Name: \"aws-iam-token\", VolumeSource: {Projected: &{Sources: {{ServiceAccountToken: &{Audience: \"sts.amazonaws.com\", ExpirationSeconds: 86400, Path: \"token\"}}}, DefaultMode: &420}}}, {Name: \"istio-envoy\", VolumeSource: {EmptyDir: &{Medium: \"Memory\"}}}, {Name: \"istio-data\", VolumeSource: {EmptyDir: &{}}}, {Name: \"istio-podinfo\", VolumeSource: {DownwardAPI: &{Items: {{Path: \"labels\", FieldRef: &{APIVersion: \"v1\", FieldPath: \"metadata.labels\"}}, {Path: \"annotations\", FieldRef: &{APIVersion: \"v1\", FieldPath: \"metadata.annotations\"}}}, DefaultMode: &420}}}, ...},\n  \tInitContainers: {{Name: \"istio-validation\", Image: \"docker.io/istio/proxyv2:1.13.4\", Args: {\"istio-iptables\", \"-p\", \"15001\", \"-z\", ...}, Env: {}\n"
                        }
                    ],
                    "name": "test-app-6d87b88f4d-sjhgm"
                },
                "message": "Pod \"test-app-6d87b88f4d-sjhgm\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \tVolumes:        {{Name: \"aws-iam-token\", VolumeSource: {Projected: &{Sources: {{ServiceAccountToken: &{Audience: \"sts.amazonaws.com\", ExpirationSeconds: 86400, Path: \"token\"}}}, DefaultMode: &420}}}, {Name: \"istio-envoy\", VolumeSource: {EmptyDir: &{Medium: \"Memory\"}}}, {Name: \"istio-data\", VolumeSource: {EmptyDir: &{}}}, {Name: \"istio-podinfo\", VolumeSource: {DownwardAPI: &{Items: {{Path: \"labels\", FieldRef: &{APIVersion: \"v1\", FieldPath: \"metadata.labels\"}}, {Path: \"annotations\", FieldRef: &{APIVersion: \"v1\", FieldPath: \"metadata.annotations\"}}}, DefaultMode: &420}}}, ...},\n  \tInitContainers: {{Name: \"istio-validation\", Image: \"docker.io/istio/proxyv2:1.13.4\", Args: {\"istio-iptables\", \"-p\", \"15001\", \"-z\", ...},",
                "status": "Failure"
            },
            "apiVersion": "audit.k8s.io/v1",
            "usr": {
                "uid": "52da0714-3f9b-4266-b135-cbb8874d9dbb",
                "extra": {
                    "authentication": {
                        "kubernetes": {
                            "io/pod-uid": [
                                "812f2809-e014-4b87-8990-c314f09ab562"
                            ],
                            "io/pod-name": [
                                "argo-rollouts-764767756c-q9wm2"
                            ]
                        }
                    }
                },
                "name": "system:serviceaccount:argo-rollouts:argo-rollouts",
                "groups": [
                    "system:serviceaccounts",
                    "system:serviceaccounts:argo-rollouts",
                    "system:authenticated"
                ],
                "id": "system:serviceaccount:argo-rollouts:argo-rollouts"
            },
            "requestReceivedTimestamp_ms": 1684421206360,
            "id": "37563848152292073641638029783155930647904799311199273151",
            "timestamp": 1684421207358,
            "auditID": "fa39da2f-a5e5-4549-94af-8c88b08eb77e",
            "requestReceivedTimestamp": "2023-05-18T14:46:46.360414Z",
            "objectRef": {
                "uid": "34748177-72a7-4fdc-873f-17f127e5c6d5",
                "apiVersion": "v1",
                "resource": "pods",
                "resourceVersion": "873303143",
                "name": "test-app-6d87b88f4d-sjhgm"
            },
            "level": "RequestResponse",
            "kind": "Event",
            "userAgent": "rollouts-controller/v0.0.0 (linux/amd64) kubernetes/$Format",
            "requestURI": "/api/v1/namespaces/test/pods/test-app-6d87b88f4d-sjhgm",
            "responseStatus": {
                "reason": "Invalid",
                "details": {
                    "kind": "Pod",
                    "causes": [
                        {
                            "reason": "FieldValueForbidden",
                            "field": "spec",
                            "message": "Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \tVolumes:        {{Name: \"aws-iam-token\", VolumeSource: {Projected: &{Sources: {{ServiceAccountToken: &{Audience: \"sts.amazonaws.com\", ExpirationSeconds: 86400, Path: \"token\"}}}, DefaultMode: &420}}}, {Name: \"istio-envoy\", VolumeSource: {EmptyDir: &{Medium: \"Memory\"}}}, {Name: \"istio-data\", VolumeSource: {EmptyDir: &{}}}, {Name: \"istio-podinfo\", VolumeSource: {DownwardAPI: &{Items: {{Path: \"labels\", FieldRef: &{APIVersion: \"v1\", FieldPath: \"metadata.labels\"}}, {Path: \"annotations\", FieldRef: &{APIVersion: \"v1\", FieldPath: \"metadata.annotations\"}}}, DefaultMode: &420}}}, ...},\n  \tInitContainers: {{Name: \"istio-validation\", Image: \"docker.io/istio/proxyv2:1.13.4\", Args: {\"istio-iptables\", \"-p\", \"15001\", \"-z\", ...}, Env: {{}\n"
                        }
                    ],
                    "name": "test-app-6d87b88f4d-sjhgm"
                },
                "message": "Pod \"test-app-6d87b88f4d-sjhgm\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds`, `spec.tolerations` (only additions to existing tolerations) or `spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)\n  core.PodSpec{\n  \tVolumes:        {{Name: \"aws-iam-token\", VolumeSource: {Projected: &{Sources: {{ServiceAccountToken: &{Audience: \"sts.amazonaws.com\", ExpirationSeconds: 86400, Path: \"token\"}}}, DefaultMode: &420}}}, {Name: \"istio-envoy\", VolumeSource: {EmptyDir: &{Medium: \"Memory\"}}}, {Name: \"istio-data\", VolumeSource: {EmptyDir: &{}}}, {Name: \"istio-podinfo\", VolumeSource: {DownwardAPI: &{Items: {{Path: \"labels\", FieldRef: &{APIVersion: \"v1\", FieldPath: \"metadata.labels\"}}, {Path: \"annotations\", FieldRef: &{APIVersion: \"v1\", FieldPath: \"metadata.annotations\"}}}, DefaultMode: &420}}}, ...},\n  \tInitContainers: {{Name: \"istio-validation\", Image: \"docker.io/istio/proxyv2:1.13.4\", Args: {\"istio-iptables\", \"-p\", \"15001\", \"-z\", ...}, Env: {{}\n",
                "status": "Failure"
            },
            "stageTimestamp": "2023-05-18T14:46:47.245695Z",
            "sourceIPs": [
                "10.52.140.48"
            ],
            "stage": "ResponseComplete",
            "service": "eks",
            "http": {
                "url_details": {
                    "path": "/api/v1/namespaces/test/pods/test-app-6d87b88f4d-sjhgm"
                },
                "status_code": 422,
                "method": "update",
                "status_category": "warning",
                "useragent_details": {
                    "os": {
                        "family": "Linux"
                    },
                    "browser": {
                        "family": "Other"
                    },
                    "device": {
                        "family": "Other",
                        "category": "Desktop"
                    }
                }
            },

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

zachaller commented 1 year ago

I have not seen this reproduced anywhere and there are things in the logs that make me think it some kyverno policy or something you have configured blocking the update can you confirm that there is not policy in place that would affect pod updates?

ngonemettle commented 1 year ago

We have kyverno installed on our cluster but we don't have policies blocking pod updates. We do have ones adding labels.

mikebryant commented 1 year ago

I think I just tracked this down. We have policies that change Pods, which has so far always been fine. What's happened here is we have a policy doing a jsonpatch, which unlike a strategic merge patch doesn't inherently become a no-op when it's repeated. We haven't had any issues elsewhere, because nothing updates Pods, only creates/deletes them. argo-rollouts mutates existing Pods here, but is aiming to touch only the metadata.. which would be fine, but triggers the policy to apply again.

Though my initial attempt to fix this by setting

    preconditions:
      all:
      - key: "{{request.operation || 'BACKGROUND'}}"
        operator: Equals
        value: CREATE

doesn't seem to have worked, so I'm still a bit confused

ngonemettle commented 1 year ago

We have now fixed the policy the kyverno policy, Argo-rollout is working fine. I'm closing the issue. thank you for your support.

zachaller commented 1 year ago

Awesome glad it is working for you