Open bpoland opened 3 years ago
@bpoland , could you elaborate how triggering a restart
to help reproduce the behavior? Is the new pod a new version? It would be helpful if some output from get rollout <rollout name>
can be provided. Thanks.
@bpoland , could you elaborate how
triggering a restart
to help reproduce the behavior? Is the new pod a new version? It would be helpful if some output fromget rollout <rollout name>
can be provided. Thanks.
Ah sorry -- I just mean manually triggering a restart of pods with kubectl argo rollouts restart <name>
or the rollouts dashboard.
With a deployment it would be kubectl rollout restart deployment/<name>
Here's the output before starting:
# kubectl argo rollouts get rollout example
Name: example
Namespace: default
Status: ✔ Healthy
Strategy: Canary
Step:
SetWeight: 100
ActualWeight: 100
Images: repo/example:994dab92f26ef64d40a3249389f82a7843fa14b4 (stable)
Replicas:
Desired: 1
Current: 1
Updated: 1
Ready: 1
Available: 1
NAME KIND STATUS AGE INFO
⟳ example Rollout ✔ Healthy 2d
├──# revision:3
│ └──⧉ example-57b9f46fb7 ReplicaSet ✔ Healthy 2h stable
│ └──□ example-57b9f46fb7-sks58 Pod ✔ Running 2h ready:2/2
├──# revision:2
│ └──⧉ example-6cc8d67b74 ReplicaSet • ScaledDown 1d
└──# revision:1
└──⧉ example-5f87868bcc ReplicaSet • ScaledDown 2d
Then I run kubectl argo rollouts restart example
and right afterwards:
# kubectl argo rollouts get rollout example
Name: example
Namespace: default
Status: ◌ Progressing
Message: updated replicas are still becoming available
Strategy: Canary
Step:
SetWeight: 100
ActualWeight: 100
Images: repo/example:994dab92f26ef64d40a3249389f82a7843fa14b4 (stable)
Replicas:
Desired: 1
Current: 1
Updated: 1
Ready: 0
Available: 0
NAME KIND STATUS AGE INFO
⟳ example Rollout ◌ Progressing 2d
├──# revision:3
│ └──⧉ example-57b9f46fb7 ReplicaSet ◌ Progressing 2h stable
│ ├──□ example-57b9f46fb7-sks58 Pod ◌ Terminating 2h ready:2/2
│ └──□ example-57b9f46fb7-kh9kh Pod PodInitializing 4s ready:0/2
├──# revision:2
│ └──⧉ example-6cc8d67b74 ReplicaSet • ScaledDown 1d
└──# revision:1
└──⧉ example-5f87868bcc ReplicaSet • ScaledDown 2d
As you can see, the running pod is immediately terminated and we are left with 0 ready/available pods until the new one starts.
Hi @jessesuen. Could you comment on this? Is that expected behavior. If not - how that could be implemented in the case of Rollout: do we need to provide some specific logic or just repeat the canary/bluegreen?
Yes. The documentation is clear on this: https://argoproj.github.io/argo-rollouts/features/restart/
Note: Unlike deployments, where a "restart" is nothing but a normal rolling upgrade that happened to be triggered by a timestamp in the pod spec annotation, Argo Rollouts facilitates restarts by terminating pods and allowing the existing ReplicaSet to replace the terminated pods. This design choice was made in order to allow a restart to occur even when a Rollout was in the middle of a long-running blue-green/canary update (e.g. a paused canary). However, some consequences of this are:
- Restarting a Rollout which has a single replica will cause downtime since Argo Rollouts needs to terminate the pod in order to replace it.
- Restarting a rollout will be slower than a deployment's rolling update, since maxSurge is not used to bring up newer pods faster.
- maxUnavailable will be used to restart multiple pods at a time (starting in v0.10). But if maxUnavailable pods is 0, the controller will still restart pods one at a time.
I would love a way to handle this, but the hard part of this problem is we need to support restarts in the middle of canary & blue-green updates. This makes the problem much harder than rolling update's "restart" which is nothing but another update.
Ah thanks, I had missed that in the docs. Just trying to understand a bit better -- why would respecting the maxSurge
cause problems during a canary/blue-green update? It seems like removing and recreating pods would already change the balance of pods while it was happening.
Sorry I think I didn't convey a lot of context. So first it's important to understand that we have a requirement to perform rollout restarts in the middle of an update (e.g. a paused rollout).
Note that with the normal rollingUpdate strategy, a restart is just another update triggered via a pod annotation. This is how kubectl rollout restart DEPLOYMENT
works, it simply annotates the pod template spec with a timestamp, and takes advantage of the fact that a new rolling update will happen which restarts all pods indirectly.
Rollouts cannot do this same trick with a timestamp in pod annotation. As with a Deployment, any change to the pod template spec is considered an update, but unlike Deployments, a Rollout update goes through canary steps, analysis, etc... This is not what we want to do if user is requesting just a restart of all their pods. So we had to invent a different mechanism to perform a restart. The mechanism is to evict/delete pods (going as low as maxUnavailable).
So because our restart mechanism is just deleting pods and allowing them to become recreated (as opposed to managing replicaset scale), we don't have a way to surge up -- at least not without complicating the code greatly.
Also, one technical limitation that we discovered about ReplicaSets is that if you scale a ReplicaSet up and then down, the ReplicaSet controller will choose the newest pod to delete when scaling down. In other words, if I have a ReplicaSet of size 5, and I want to restart all 5 pods, if I scale the ReplicaSet to 10 and then back down to 5, I will be left with the original 5 pods. This goes against our desire to restart the oldest pods (which have a creationTimestamp before the restart timestamp).
Thanks for the detailed info, that's very helpful!
It does seem like it would be possible to scale up the ReplicaSet and then manually evict the oldest pods as it does now, but I can see how that would complicate the code. I also understand this is an edge case that isn't the highest priority :)
kubectl-argorollout-restart
:
#!/usr/bin/env bash
set -euo pipefail
if [ "$#" -ne "2" ]; then
echo "usage: $0 NAMESPACE ROLLOUT"
exit 2
fi
NAMESPACE="$1"
ROLLOUT="$2"
paused="$(kubectl get rollout -ojsonpath="{.spec.paused}" -n "$NAMESPACE" "$ROLLOUT")"
if [ "$paused" = "true" ]; then
echo "can't restart paused rollout, unpause by removing .spec.paused=true first" >&2
exit 1
fi
kubectl patch rollout --type merge -n "$NAMESPACE" "$ROLLOUT" -p "$(cat <<EOF
{"spec": { "template": { "metadata": { "annotations": {
"kubectl.kubernetes.io/restartedAt": "$(date --rfc-3339=seconds)"
}}}}}
EOF
)"
Hello,
In my case, argo rollout restart
causes random short downtime even with multiple replicas.
It seems that traffic goes to killed pod for short time, which leads 500 error from ingress controller. (in my case AWS ingress controller).
I guess kubectl.kubernetes.io/restartedAt
feature might not support zero-downtime. (Stop forwarding traffic to pod, send kill signal to pod and wait for grace period, and kill. https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination)
I hope that argo rollout restart
follows normal rollout process like rollout restart
does.
Restarting pods for whatever reasons (eg. hygiene purposes, reloading secrets) SHOULD be tested like other change operations.
Since there can be a lot of unexpected problems: cache stampede, updated secret be invalid value, etc.
Safe rather than sorry.
This issue is stale because it has been open 60 days with no activity.
Resolved it by using WorkloadRef to Deployment and controlling rolling restart via Deployment. Only down side is replica's are managed by rollout so the deployment is not aware how many replicas we are running on !! . But atleast we solve immediate terminate
I am having the same issue with the Rollout that has 2 replicas also. When I use restart
, the Rollout will terminate 1 current Stable running pod and create a new one simultaneously.
I expected the behavior should be the same as when I rollout a new version. The Canary pods will start first, after that the Stable pod will be terminated.
For Blue Green
strategy, the logic works fine when I restart the rollout. New Blue pods will start first --> Switch traffic from Green to Blue --> Terminate Green pods
@AnhQKatalon : did you tried with workload ref to deployment ?
Same here, how to make argo rollouts to restart pod one by one and not terminate old until new not become healthy
Define a PodDisruptionBudget
. Or use canary
with dynamicStableScale
and incremental setWeight
steps. Or use canary
without steps but with maxUnavailable
and maxSurge
.
This story only applies to single-replica Rollouts. The behavior you want is already supported. https://argo-rollouts.readthedocs.io/en/stable/features/canary/#mimicking-rolling-update
Hi, using a BlueGreen deployment with only 1 pod, the restart action via the ArgoCD UI kills my pod and recreates a new one, generating an unavailability. I'm looking to have the same behavior as with a classic Deployment object having a "RollingUpdate" strategy.
Summary
In a Rollout with 1 replica, triggering a restart causes the running pod to be terminated immediately at the same time that the new pod starts up. This happens even when explicitly setting
maxUnavailable: 0
(the default is documented as 25% which I would have expected to be rounded down to 0 pods). This causes no pods to be available until the new one becomes ready.I know 1 replica is risky anyways but triggering a restart of a Deployment with 1 pod does what I would expect: it starts a new pod, waits for it to become ready, then terminates the old pod.
Diagnostics
Rollouts 1.1
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.