argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.72k stars 849 forks source link

Rollouts cycle into degraded state during blue-green pause #3843

Open miles-w-3 opened 6 days ago

miles-w-3 commented 6 days ago

Checklist:

Describe the bug When a Rollout using the blue-green deployment strategy is left in a suspended state for 15+ minutes, it cycles into a Degraded state before regularly getting set back to suspended.

To Reproduce Trigger a blue-green preview for your rollout, then leave it in a suspended state. Eventually, you will see events cycling it between Suspended to degraded and back.

Expected behavior The rollout remains in a consistent suspended state until resumed or aborted

Screenshots

Version first discovered on 2.32.2, still reproducible on master

Logs Shows rollout switching to degraded state, then switching back to paused

INFO[0917] Processing completed                          resource=default/fish
INFO[0917] Patched: {"status":{"conditions":[{"lastTransitionTime":"2024-09-11T05:30:19Z","lastUpdateTime":"2024-09-11T05:30:19Z","message":"Rollout has minimum availability","reason":"AvailableReason","status":"True","type":"Available"},{"lastTransitionTime":"2024-09-20T13:24:36Z","lastUpdateTime":"2024-09-20T13:24:36Z","message":"Rollout is not healthy","reason":"RolloutHealthy","status":"False","type":"Healthy"},{"lastTransitionTime":"2024-09-20T13:24:36Z","lastUpdateTime":"2024-09-20T13:24:36Z","message":"RolloutCompleted","reason":"RolloutCompleted","status":"False","type":"Completed"},{"lastTransitionTime":"2024-09-20T13:24:37Z","lastUpdateTime":"2024-09-20T13:24:37Z","message":"Rollout is paused","reason":"RolloutPaused","status":"True","type":"Paused"},{"lastTransitionTime":"2024-09-20T19:08:33Z","lastUpdateTime":"2024-09-20T19:08:33Z","message":"ReplicaSet \"fish-79bfcd94f7\" has timed out progressing.","reason":"ProgressDeadlineExceeded","status":"False","type":"Progressing"}],"message":"ProgressDeadlineExceeded: ReplicaSet \"fish-79bfcd94f7\" has timed out progressing.","phase":"Degraded"}}  generation=3 namespace=default resourceVersion=64792278 rollout=fish
INFO[0917] persisted to informer                         generation=3 namespace=default resourceVersion=64799182 rollout=fish
INFO[0917] Reconciliation completed                      generation=3 namespace=default resourceVersion=64792278 rollout=fish time_ms=97.999375
INFO[0917] Started syncing rollout                       generation=3 namespace=default resourceVersion=64799182 rollout=fish
INFO[0917] invalidated cache for resource in namespace: argo-rollouts with the name: argo-rollouts-notification-configmap
INFO[0917] Patched conditions: {"status":{"conditions":[{"lastTransitionTime":"2024-09-11T05:30:19Z","lastUpdateTime":"2024-09-11T05:30:19Z","message":"Rollout has minimum availability","reason":"AvailableReason","status":"True","type":"Available"},{"lastTransitionTime":"2024-09-20T13:24:36Z","lastUpdateTime":"2024-09-20T13:24:36Z","message":"Rollout is not healthy","reason":"RolloutHealthy","status":"False","type":"Healthy"},{"lastTransitionTime":"2024-09-20T13:24:36Z","lastUpdateTime":"2024-09-20T13:24:36Z","message":"RolloutCompleted","reason":"RolloutCompleted","status":"False","type":"Completed"},{"lastTransitionTime":"2024-09-20T13:24:37Z","lastUpdateTime":"2024-09-20T13:24:37Z","message":"Rollout is paused","reason":"RolloutPaused","status":"True","type":"Paused"},{"lastTransitionTime":"2024-09-20T19:08:33Z","lastUpdateTime":"2024-09-20T19:08:33Z","message":"Rollout is paused","reason":"RolloutPaused","status":"Unknown","type":"Progressing"}],"message":"BlueGreenPause","phase":"Paused"}}  generation=3 namespace=default resourceVersion=64799182 rollout=fish

I believe this is a bug where the logic to exclude paused states from the progression timeout is only checking for canary pauses, not blue-green pauses. I will try to add logic to also check for a blue-green pause


Message from the maintainers:

Impacted by this bug? Give it a πŸ‘. We prioritize the issues with the most πŸ‘.

ipeacocks commented 4 days ago

It's kind of not a bug but feature. You can increase progressDeadlineSeconds which by default is 10 mins.

miles-w-3 commented 3 days ago

The progressDeadlineSeconds are not supposed to increase while the Rollout is in a paused, according to the spec here:

  # The maximum time in seconds in which a rollout must make progress during
  # an update, before it is considered to be failed. Argo Rollouts will
  # continue to process failed rollouts and a condition with a
  # ProgressDeadlineExceeded reason will be surfaced in the rollout status.
  # Note that progress will not be estimated during the time a rollout is
  # paused.
  # Defaults to 600s
  progressDeadlineSeconds: 600