Closed nickjj closed 2 years ago
At the time of writing, Eviction was policy/v1beta1
, but it is changed to policy/v1
since 1.22
The pod/eviction subresource now accepts policy/v1 Eviction requests in addition to policy/v1beta1 Eviction requests (https://github.com/kubernetes/kubernetes/pull/100724, @liggitt) [SIG API Machinery, Apps, Architecture, Auth, CLI, Storage and Testing] https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.22.md#api-change-9
It might be related with around this https://github.com/foriequal0/pod-graceful-drain/blob/main/internal/pkg/webhooks/eviction_mutator.go#L66 However, I have a little experience with k8s api migration. It might take time to prepare a new version.
https://github.com/foriequal0/pod-graceful-drain/pull/32 Would this PR fix for this?
Any suggestions on how to install the version in the PR?
For example normally I'd Helm install it but without a release I'm not sure how to install it without Helm.
To test this locally you could:
Then initiate a cluster upgrade to 1.22. The control plane will upgrade fine, it's the worker nodes that can't be upgraded.
I'm happy to do all of this testing but kind of blocked on how to upgrade my existing Helm installed version of pod graceful drain with the version in the PR.
Also I spun up a brand new test cluster with 1.21 and tested the 1.22 upgrade workflow without having pod graceful drain installed in the cluster. Everything got tainted, evicted and drained pretty quickly without errors.
There was 30 seconds of 504 related downtime on a test nginx deployment I had set up but I'm 90% sure that was due to not having pod graceful drain installed! I just wanted to confirm the error is fully isolated to this project (which it seems to be).
If you were feeling semi-confident in this patch, if it's not possible to Helm install something from a branch perhaps you could cut a beta release, like 0.0.8beta
?
Okay. I've relased v0.0.8-beta.1 here. Chart version is v0.0.10-beta.1
Thanks for making this release so quickly!
The short version is everything mostly worked. The eviction aspect worked so nodes were able to get drained. I was also able to perform a full 1.21 to 1.22 upgrade (control plane and worker nodes) with zero downtime to a running web service, however when I manually recreated nodes a 2nd time I experienced 504s for 90 seconds -- I was able to repeat this 504 issue twice. I'm not 100% sure if this is related to Pod Graceful Drain or Terraform.
Here's a break down of what I tested and how I tested it.
v0.0.10-beta.1
Argo CD in this case is our example app to test zero downtime deploys.
For testing that the app remains up I'm using https://github.com/nickjj/lcurl which makes a request to a host every 250ms and reports back the status code and a few other stats.
In all of the tests below I'm running lcurl https://argocd.example.com 0.25
where example.com
is replaced with my real host name.
This is a basic sanity check to ensure things work normally independently of evicting pods and draining nodes.
Restart command: kubectl -n argocd rollout restart deployment.apps/argocd-server
I also ran kubectl get pods -n argocd --watch
to keep an eye on the pods.
Without Pod Graceful Drain:
With Pod Graceful Drain
This was done with Terraform's EKS module. I renamed the node group by appending -2
to its name. This supposedly creates a new set of nodes while doing all of the lower level tainting, draining and evicting, etc. and deletes the old nodes afterwards.
I ran kubectl get pods -n argocd --watch
to keep an eye on the pods and kubectl get nodes -o wide --watch
to watch the nodes.
With Pod Graceful Drain
This 90 second downtime is related to how long Argo CD takes to spin up. This period of downtime would increase depending on how long all apps take to come up.
I'm not sure if this is related to Pod Graceful Drain. I've never done a node upgrade before.
At least the pods can be evicted now which I think means your patch is a success. As for the downtime when nodes get re-created do you think that's related or unrelated to Pod Graceful Drain? Is there anything in Pod Graceful Drain that could maybe interfere with the draining process?
I was under the impression once a node gets tainted new pods will not be scheduled and once new nodes join your cluster that's capable of running them the old nodes will get drained which involves running duplicate copies of the pods on the new nodes, once that process finishes the old nodes will be terminated. In theory during this process there would always be at least 1 pod running to ensure zero downtime?
I deleted the old 1.22 cluster and made a new 1.21 cluster.
I confirmed 1.21 is capable of running v0.0.10-beta.1
on its own without issues independent of upgrading the cluster. I was able to achieve zero downtime deploys of Argo CD using the first rollout test from above.
This worked flawlessly. There were no 502 or 504s reported. There were 2,200 consecutive 200s reported in a row while the pods got moved from the old nodes to the new nodes in roughly 13 minutes (+12 minutes to upgrade the control plane).
I confirmed 1.22 is capable of running v0.0.10-beta.1
on its own without issues independent of upgrading the cluster. I was able to achieve zero downtime deploys of Argo CD using the first rollout test from above.
Just to see if the first time was a fluke I did the same rename process as before and experienced the same 504s downtime for 90 seconds. It's interesting that a cluster upgrade had zero downtime but renaming the node group afterwards has downtime.
If you need any more details please let me know.
Thank you for very detailed report! I'm happy to hear that it worked for both 1.21 and 1.22. I'll dig into the issue with renaming the node group.
I found the minimal 100% reproduction step.
sleep 30
is fine)kubectl drain
on a node that the pod is located.The normal rollout process spins up the additional pods first, then terminates the previous pods. It is gracefully controlled by the deployment controller or the replicaset controller. However, the eviction process seems to be different. Pods are evicted from the node first (and pods are drained) then replicaset controller tries to reconcile the pod replica count later without being noticed by eviction processors. I didn't recognized this until now.
Also while I'm trying to reproduce this, I found that the eviction behavior triggered a concurrency issue that evicted pod skips admission webhooks.
pod-graceful-drain
is evicted first, and it is the only replica in a cluster. Other pods are not ready at this time.pod-graceful-drain
.To mitigate this, I think we should have enough replicas and make sure they are distributed across multiple nodes. It is also good for the availability in general. The eviction process is similar to general node failures. It might be applied to the pod-graceful-drain
itself too.
I might be able to temporarily increase the replica count of the deployment when the pod is requested to be evicted in later versions.
I'll release the binary v0.0.8, and the chart 0.0.10 for addressing k8s 1.22 soon. Can we close this issue and continue on here https://github.com/foriequal0/pod-graceful-drain/issues/33 on these evicted pods?
Yep, sounds good.
Hi,
I've been using your package successfully for a few months now. Today I tried to upgrade my EKS cluster from 1.21 to 1.22 and I think this issue is related to pod graceful drain.
The worker nodes couldn't be drained because I got this type of error for a bunch of pods. I've included the exact error for all of the pods that were unable to be evicted which results in the node not being able to be drained / upgraded.
It's a test cluster so this is basically everything running on the cluster:
Any tips on where to go from here?