argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.72k stars 850 forks source link

Argo rollouts manages virtualService field after a rollout is complete. Causes future changes to VS to not sync #3854

Open taer opened 16 hours ago

taer commented 16 hours ago

Checklist:

Describe the bug We use rollouts with istio traffic management. Initially, when argo-rollouts changed the weights in the virtual service, argocd would fight it, causing sync flapping. We used to have auto-sync on, so this was a problem. To remedy, we added this to the application

  ignoreDifferences:
  - group: '*'
    kind: '*'
    managedFieldsManagers:
    - rollouts-controller

This solved the sync issue.

Now, when a rollout occurs, Rollouts correctly modifies the VirtualService weights for the canary, and argo doesn't fight it because rollouts set this in the VS prior to changing the weights

    - apiVersion: networking.istio.io/v1alpha3
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:http: {}
      manager: rollouts-controller
      operation: Update
      time: '2024-09-26T20:24:14Z'

All is well for the rollout. However, once one rollout happens, that managedField remains in the virtualService, making Argocd never see any changes to the VS unless we manually remove that managed field. Then it instantly sees it

To Reproduce reproducible deploy here: https://github.com/taer/argo-rollout-issue/tree/master

use the linked application. Deploy it. No VS changes occur. Change the VS http section. I changed the prefix

    - match:
        - uri:
            prefix: /headers

Prior to a rollout, changing this results in an expected diff. Now change the pod template. I edited

        - name: CHANGEME
          value: aa

to force a rollout. ArgoRollouts starts, deploys the new pod, then adds the ManagedField to the VS and starts to alter the weights for the canary deploy. When complete, the managed fields still exist. Now this means any future changes to the VirtualService in git are ignored.

Expected behavior

I expect the managedField argo rollouts inserted to be removed when the rollout is complete so that future virtual service changes can be noticed by argocd.

Version Argo-rollouts: 1.7.2 argo-rollouts-helm: 2.37.5 istio: 1.22.3

Logs

Logs are in the github repro repo. https://github.com/taer/argo-rollout-issue/blob/master/errorLog.log


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

taer commented 15 hours ago

I've traced the logic down to here: https://github.com/argoproj/argo-rollouts/blob/master/rollout/trafficrouting/istio/istio.go#L1501 func (r *Reconciler) RemoveManagedRoutes() error {

This smells like it wants to reset the managed routes. it's called from reconcileTrafficRouting https://github.com/argoproj/argo-rollouts/blob/master/rollout/trafficrouting.go#L176

        } else if rolloututil.IsFullyPromoted(c.rollout) {
            err := reconciler.RemoveManagedRoutes()
            if err != nil {
                return err
            }

I don't see any logs about RemoveManagedRoutes getting called, and that's about as far as I can make it. I'm wondering if the IsFullyPromoted is saying false ro.Status.StableRS == ro.Status.CurrentPodHash

in my rollout status, they are equal. Of course, so is the canary since it's complete

status:
  HPAReplicas: 1
  availableReplicas: 1
  blueGreen: {}
  canary:
    weights:
      canary:
        podTemplateHash: 689f7f6654
        serviceName: httpbin-canary
        weight: 0
      stable:
        podTemplateHash: 689f7f6654
        serviceName: httpbin
        weight: 100
  conditions:
    - lastTransitionTime: '2024-09-26T20:19:49Z'
      lastUpdateTime: '2024-09-26T20:19:49Z'
      message: Rollout has minimum availability
      reason: AvailableReason
      status: 'True'
      type: Available
    - lastTransitionTime: '2024-09-26T20:39:08Z'
      lastUpdateTime: '2024-09-26T20:39:08Z'
      message: RolloutCompleted
      reason: RolloutCompleted
      status: 'True'
      type: Completed
    - lastTransitionTime: '2024-09-26T20:39:08Z'
      lastUpdateTime: '2024-09-26T20:39:08Z'
      message: Rollout is paused
      reason: RolloutPaused
      status: 'False'
      type: Paused
    - lastTransitionTime: '2024-09-26T20:39:39Z'
      lastUpdateTime: '2024-09-26T20:39:39Z'
      message: Rollout is healthy
      reason: RolloutHealthy
      status: 'True'
      type: Healthy
    - lastTransitionTime: '2024-09-26T20:39:08Z'
      lastUpdateTime: '2024-09-26T20:39:39Z'
      message: ReplicaSet "httpbin-689f7f6654" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: 'True'
      type: Progressing
  currentPodHash: 689f7f6654
  currentStepHash: 84d9c65666
  currentStepIndex: 2
  observedGeneration: '6'
  phase: Healthy
  readyReplicas: 1
  replicas: 1
  selector: app=httpbin
  stableRS: 689f7f6654
  updatedReplicas: 1