argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.79k stars 873 forks source link

Aborting canary rollouts with ALB based traffic-routing causes toggling rollouts-pod-template-hash in preview service #3758

Open matbhe opened 3 months ago

matbhe commented 3 months ago

Describe the bug Aborting a running canary rollout with ALB traffic-routing can cause a toggling rollouts-pod-template-hash value in the preview service. This toggling rollouts-pod-template-hash in the preview service probably causes then "instabilities" in the ALB targetgroup and then results in 503 errors in the alb.

To Reproduce Create a scenario like defined in the attached manifest.txt, that means:

When everything is up and running (initial rollout healthy and ingresses reachable) then do the following:

  1. Replace the container in the rollout with rollouts-demo:green and apply the change
  2. Wait until the rollout is in progress and the analysis template is running
  3. Abort the running rollout (kubectl argo rollouts abort abort-test-rollout)

For easier testing I've attached a simple test script (should work if kubectl and the aws cli is working):

  1. Copy manifest.txt and testscript.txt to a local folder
  2. Verify that kubectl and aws cli is working
  3. Adjust domains and ingress certificates in in manifest.txt
  4. Adjust argo-rollouts namespace in testscript.txt
  5. Run ./testscript.txt (chmod..)

manifest.txt testscript.txt

Expected behavior Argo rollouts should not toggle the rollouts-pod-template-hash values in the preview service and cause 503 errors in the alb.

Version Tested with version 1.7.1 (release-1.7 branch)

Logs Controller rollouts-controller.log Show the reconcilication loop...

Resource - Watch (preview service) resource-watch.log Shows the rollouts-pod-template-hash toggling...

TargetGroups target-groups.log Shows the the flickering target groups

zachaller commented 3 months ago

This needs to be documented because we ran into the same issue in order to usa ALB your must use ping pong and not have canaryService and stableService defined in the rollout object. To mimic the same behavior you can use ephemeral metadata and k8s services with static selectors from your ephemeral metadata.

danil-smirnov commented 3 months ago

@zachaller Is there a way to have K8s service still pointing to the canary pod always when using the Ping-pong feature? We need this for testing purpose

zachaller commented 3 months ago

Yes, you can use the ephemeralMetadata field and a service that manually selects it.

pshrm commented 3 months ago

Yes, you can use the ephemeralMetadata field and a service that manually selects it.

So you suggest to create addtional services slecting pods based on ephemeral metadata apart from ping and pong service objects. In the minimum we would have 5 services