Prevent 5xx error during evicting deployments that has small replica count

foriequal0 commented 2 years ago

If your deployments are well replicated, then this is not a problem. However, when they are not, then there's still be ALB 5xx errors while they are evicted. These eviction can be triggered by kubectl drain, node termination, etc.

Evicted pods should be handled in this order:

The pod is requested to be evicted.
Increase the replica count of the deployment of the requested pod. However, you can't isolate the pod yet since it'll drain the pod.
When the new pods are ready, you can isolate the pod, restoring the replica count, and starts delayed deletion.

nickjj commented 2 years ago

Hi, I'm following up here with quotes from your linked comment.

However, the eviction process seems to be different. Pods are evicted from the node first (and pods are drained) then replicaset controller tries to reconcile the pod replica count later without being noticed by eviction processors. I didn't recognized this until now.

It's interesting that it worked during a cluster upgrade which did create new nodes but not when recreating a node group independently. Is this process different between both of those methods?

I might be able to temporarily increase the replica count of the deployment when the pod is requested to be evicted in later versions.

This seems pretty scary because with a manually sized cluster there might not be room to add +1 replica for every deployment at once, or would this happen 1 deployment at a time so there's only ever +1 replica for 1 deployment at a time? If it's the second option that might work but is still a little scary since you might have a node with 8GB of memory with 2 pods running that requires 3GB of memory each, the node couldn't fit a 3rd pod -- even temporarily.

To mitigate this, I think we should have enough replicas and make sure they are distributed across multiple nodes.

At the moment I'm running a 2 node cluster across 2 azs with 2 replicas of everything. That's just enough safety for me to feel comfortable without breaking the bank. I suppose where I'm going with this one is ideally maybe we shouldn't expect there always being a large amount of replicas across a bunch of nodes?

I'll admit most of what your library does is a bit above my pay grade in Kubernetes knowledge. I have a surface level knowledge of draining / evicting from an end user's POV. In the end I'm happy to do whatever testing you want and offer any use case examples I can to help resolve this in a way that you feel is solid and safe at an implementation level.

foriequal0 commented 2 years ago

Thank you for the kind words! I am also not sure about the solution I came up with for this issue. Let's sit still and think for some time.

Since you have 2 replicas for everything, I guess that the downtime during the node group renaming was due to the skipping webhooks. Actually, I couldn't reproduce downtime reliably with renaming the node group. That's why I think you might have been lucky when you upgrade the cluster. I'll create another issue for this here #34 :sweat_smile: I think that would be more appropriate for this situation.

foriequal0 / pod-graceful-drain

Prevent 5xx error during evicting deployments that has small replica count #33