argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.61k stars 820 forks source link

experiments: recovered from panic, on v1.4 #2608

Open Lykathia opened 1 year ago

Lykathia commented 1 year ago

When running an experiment, a number of nil pointer exceptions are thrown - before the experiment eventually passes and moseys on its merry way.

We use istio in our environment.

It seems to resolve itself after a dozen+ failures.

To Reproduce

spec:
  analysis:
    successfulRunHistoryLimit: 1
    unsuccessfulRunHistoryLimit: 1
  progressDeadlineAbort: true
  replicas: 2
  revisionHistoryLimit: 5
  strategy:
    canary:
      steps:
        - experiment:
            analyses:
              - args:
                  - name: api-root-url
                    value: >-
                      http://example-preview.default.svc.cluster.local:8080
                  - name: request-timeout-milliseconds
                    value: '1000'
                name: example-smoke-test-analysis
                requiredForCompletion: true
                templateName: example-smoke-test-analysis
            duration: 5m
            templates:
              - metadata:
                  labels:
                    app.kubernetes.io/name: example-preview
                name: preview
                replicas: 1
                selector:
                  matchLabels:
                    app.kubernetes.io/name: example-preview
                specRef: canary
        - setWeight: 5
        - pause:
            duration: 10m
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: canary
            name: example-mesh-destination
            stableSubsetName: stable
          virtualService:
            name: example-mesh-virtualservice
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-svc-deployment

Expected behavior

The logs to not raise errors of nil pointer exceptions

Version

1.4

Logs

Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 325 [running]:
runtime/debug.Stack()
    /usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x58
panic({0x21c1260, 0x3c0ff10})
    /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).calculateWeightDestinationsFromExperiment(0xc00198ac00)
    /go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:333 +0x1f7
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileTrafficRouting(0xc00198ac00)
    /go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:176 +0x7c5
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutCanary(0xc00198ac00)
    /go/src/github.com/argoproj/argo-rollouts/rollout/canary.go:56 +0x166
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc00198ac00)
    /go/src/github.com/argoproj/argo-rollouts/rollout/context.go:86 +0xe7
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc0000e48c0, {0x29d81b0, 0xc000806c00}, {0xc0023f39b0, 0x27})
    /go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:415 +0x4d3
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x89
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x29e4220?, 0xc000522520}, {0x25881fc, 0x7}, 0xc0009a7e70, {0x29d81b0, 0xc000806c00}, 0xc0005812c0?, {0x2093520, 0xc001bdde70})
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x40b
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x29d81b0, 0xc000806c00}, {0x29e4220, 0xc000522520}, {0x25881fc, 0x7}, 0x0?, 0x0?)
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0xbf
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
    /go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:336 +0xbe
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x0?)
    /go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x29b80e0, 0xc00150ccf0}, 0x1, 0xc00017d0e0)
    /go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0x0?)
    /go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x0?, 0x0?, 0x0?)
    /go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:90 +0x25
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
    /go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:335 +0xa7
zachaller commented 1 year ago

@Lykathia Can you try v1.4.1

Lykathia commented 1 year ago

@zachaller yep, PR in - will get it reviewed and tested on Monday and report back. Thanks!

Lykathia commented 1 year ago

1.4.1 doesn't not appear to have addressed the issue.

Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 337 [running]:
runtime/debug.Stack()
    /usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1.1()
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:149 +0x58
panic({0x21c24a0, 0x3c11f30})
    /usr/local/go/src/runtime/panic.go:884 +0x212
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).calculateWeightDestinationsFromExperiment(0xc0019b8c00)
    /go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:333 +0x1f7
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcileTrafficRouting(0xc0019b8c00)
    /go/src/github.com/argoproj/argo-rollouts/rollout/trafficrouting.go:176 +0x7c5
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).rolloutCanary(0xc0019b8c00)
    /go/src/github.com/argoproj/argo-rollouts/rollout/canary.go:56 +0x166
github.com/argoproj/argo-rollouts/rollout.(*rolloutContext).reconcile(0xc0019b8c00)
    /go/src/github.com/argoproj/argo-rollouts/rollout/context.go:86 +0xe7
github.com/argoproj/argo-rollouts/rollout.(*Controller).syncHandler(0xc000210700, {0x29d99f0, 0xc0002ea100}, {0xc002159dd0, 0x27})
    /go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:415 +0x4d3
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1.1()
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:153 +0x89
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem.func1({0x29e5a60?, 0xc000512c20}, {0x25899bc, 0x7}, 0xc000e4de70, {0x29d99f0, 0xc0002ea100}, 0xc000317080?, {0x2094760, 0xc002dc11d0})
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:157 +0x40b
github.com/argoproj/argo-rollouts/utils/controller.processNextWorkItem({0x29d99f0, 0xc0002ea100}, {0x29e5a60, 0xc000512c20}, {0x25899bc, 0x7}, 0x2094760?, 0x29b0250?)
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:171 +0xbf
github.com/argoproj/argo-rollouts/utils/controller.RunWorker(...)
    /go/src/github.com/argoproj/argo-rollouts/utils/controller/controller.go:104
github.com/argoproj/argo-rollouts/rollout.(*Controller).Run.func1()
    /go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:336 +0xbe
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x3332303622?)
    /go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:155 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x3a22736c6562616c?, {0x29b9900, 0xc000d63590}, 0x1, 0xc00055d740)
    /go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:156 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x706d65742d646f70?, 0x3b9aca00, 0x0, 0x37?, 0x61746f6e6e61222c?)
    /go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:133 +0x89
k8s.io/apimachinery/pkg/util/wait.Until(0x6a6f72706f677261?, 0x636e79732f6f692e?, 0x223a22657661772d?)
    /go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:90 +0x25
created by github.com/argoproj/argo-rollouts/rollout.(*Controller).Run
    /go/src/github.com/argoproj/argo-rollouts/rollout/controller.go:335 +0xa7
zachaller commented 1 year ago

@Lykathia I would love to see your rollout object with the status field intact if you could share that it would be awesome. I also have a question if you are trying to use a weight on your experiment which I should be able to gather from the sharing of your ro object.

Lykathia commented 1 year ago

Sure thing! Some nouns edited and some annotations / labels removed

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: '10'
    rollout.argoproj.io/revision: '46'
    rollout.argoproj.io/workload-generation: '52'
  creationTimestamp: '2022-06-03T17:50:51Z'
  generation: 39
  labels:
    app.kubernetes.io/instance: production-eks-prod-examplens-example
  name: example-rollout-c89ac2e8
  namespace: examplens
  resourceVersion: '614921626'
  uid: 86e21dfc-7adf-40f5-a9de-b5d368b0a843
spec:
  analysis:
    successfulRunHistoryLimit: 1
    unsuccessfulRunHistoryLimit: 1
  progressDeadlineAbort: true
  replicas: 2
  restartAt: '2022-11-09T19:40:00Z'
  revisionHistoryLimit: 5
  strategy:
    canary:
      steps:
        - experiment:
            analyses:
              - args:
                  - name: api-root-url
                    value: >-
                      http://example-rollout-preview.examplens.svc.cluster.local:8080
                  - name: request-timeout-milliseconds
                    value: '1000'
                name: example-smoke-test-analysis
                requiredForCompletion: true
                templateName: example-smoke-test-analysis
            duration: 5m
            templates:
              - metadata:
                  labels:
                    app.kubernetes.io/name: example-rollout-preview
                name: preview
                replicas: 1
                selector:
                  matchLabels:
                    app.kubernetes.io/name: example-rollout-preview
                specRef: canary
        - setWeight: 5
        - pause:
            duration: 10m
      trafficRouting:
        istio:
          destinationRule:
            canarySubsetName: canary
            name: example-mesh-destination-c833f030
            stableSubsetName: stable
          virtualService:
            name: example-mesh-virtualservice-c8e2aa97
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-svc-deployment-c8e30830
status:
  HPAReplicas: 2
  availableReplicas: 2
  blueGreen: {}
  canary:
    weights:
      canary:
        podTemplateHash: 5ccdfd574c
        weight: 0
      stable:
        podTemplateHash: 5ccdfd574c
        weight: 100
  conditions:
    - lastTransitionTime: '2023-02-27T14:45:05Z'
      lastUpdateTime: '2023-02-27T14:45:05Z'
      message: Rollout has minimum availability
      reason: AvailableReason
      status: 'True'
      type: Available
    - lastTransitionTime: '2023-02-27T15:12:21Z'
      lastUpdateTime: '2023-02-27T15:12:21Z'
      message: Rollout is paused
      reason: RolloutPaused
      status: 'False'
      type: Paused
    - lastTransitionTime: '2023-02-27T15:12:32Z'
      lastUpdateTime: '2023-02-27T15:12:32Z'
      message: RolloutCompleted
      reason: RolloutCompleted
      status: 'True'
      type: Completed
    - lastTransitionTime: '2023-02-27T15:13:02Z'
      lastUpdateTime: '2023-02-27T15:13:02Z'
      message: Rollout is healthy
      reason: RolloutHealthy
      status: 'True'
      type: Healthy
    - lastTransitionTime: '2023-02-27T15:12:21Z'
      lastUpdateTime: '2023-02-27T15:13:02Z'
      message: >-
        ReplicaSet "example-rollout-c89ac2e8-5ccdfd574c" has successfully
        progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
  currentPodHash: 5ccdfd574c
  currentStepHash: 75dcbd8f68
  currentStepIndex: 3
  observedGeneration: '39'
  phase: Healthy
  readyReplicas: 2
  replicas: 2
  restartedAt: '2022-11-09T19:40:00Z'
  selector: app.kubernetes.io/name=example-svc-c8333369
  stableRS: 5ccdfd574c
  updatedReplicas: 2
  workloadObservedGeneration: '52'

everything seems to complete fine and as expected, just the logs are exploding w/ the NPEs.

github-actions[bot] commented 1 year ago

This issue is stale because it has awaiting-response label for 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.

zachaller commented 1 year ago

Maybe related https://github.com/argoproj/argo-rollouts/pull/2734

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.

zimmertr commented 6 months ago

I saw similar exceptions occurring when calculateWeightDestinationsFromExperiment was called. My guess was that Argo was calling this function to determine the weight the experiment should have and it was returning NULL or an unexpected object or something. Then Argo would try and adjust the Virtual Service weight to this value, and Istio's admission controller would reject the non-zero/positive integer value and Argo would fail to catch the exception gracefully.

I solved it by adding a weight: # to my experiment.