During the initialization that is performed when migrating from a standard Deployment to a Canary, there is a slight downtime when we apply the following canary:
This happens because final part of the initialization (from my understanding) is done in this order:
Scaling down the replicas for the test-appDeployment to 0
Update the test-appService's selectorLabels to reference the primary pods
Update the TraefikService to point to the test-app-primaryService
Due to this, during steps 1 and 2, there is a slight time window in which there will be a couple of 502 errors returned, since the test-appService has no pods to reference.
To Reproduce
Apply the configurations above, and have some tool such as https://github.com/tsenart/vegeta doing requests to the ingress, and see the 502 errors being returned.
Expected behavior
No 502s returned when migrating to Canary.
Additional context
Flagger version: v1.31.0
Kubernetes version: 1.25
Service Mesh provider: traefik
Updatting the Service before scaling down the Deployment (swapping points 1 and 2) seems to be a good option to fix this issue. I wouldn't mind submitting a PR with this fix, but I'd like to make sure that this would be a correct approach.
hello @miguelvalerio, thanks for opening this issue and volunteering to fix it! please take it on, your recommended fix should work. looking forward to your PR :)
Describe the bug
Currently, we have the following (simplified) setup:
During the initialization that is performed when migrating from a standard
Deployment
to aCanary
, there is a slight downtime when we apply the following canary:This happens because final part of the initialization (from my understanding) is done in this order:
test-app
Deployment
to 0test-app
Service
'sselectorLabels
to reference theprimary
podsTraefikService
to point to thetest-app-primary
Service
Due to this, during steps 1 and 2, there is a slight time window in which there will be a couple of
502
errors returned, since thetest-app
Service
has no pods to reference.To Reproduce
Apply the configurations above, and have some tool such as https://github.com/tsenart/vegeta doing requests to the ingress, and see the
502
errors being returned.Expected behavior
No
502
s returned when migrating toCanary
.Additional context
Updatting the
Service
before scaling down theDeployment
(swapping points 1 and 2) seems to be a good option to fix this issue. I wouldn't mind submitting a PR with this fix, but I'd like to make sure that this would be a correct approach.