foriequal0 / pod-graceful-drain

You don't need `lifecycle: { preStop: { exec: { command: ["sleep", "30"] } } }`
Apache License 2.0
246 stars 16 forks source link

Proper HA setup #34

Open foriequal0 opened 2 years ago

foriequal0 commented 2 years ago

During node draining, pod-graceful-drain itself may also be evicted. I deliberately chosen to ignore webhook admission failures since otherwise deployments would fail to progress. Because of this, pods that are evicted/deleted at that time can suffer downtime even with multiple replicas.

To fix this, pod-graceful-drain needs a HA setup. However, simply giving it replicas: 2 on the deployment won't work. It would not behave correctly when there are multiple replicas of it.

nickjj commented 2 years ago

Hi,

I know you're not under any obligation to develop this and appreciate all you've done already but are there any works in progress towards this or #33? At the moment this prevents being able to upgrade nodes without hard downtime.

I really wish I could contribute but low level custom Kubernetes add-ons is still above my pay grade. It may not be forever but it is in the foreseeable future.

foriequal0 commented 2 years ago

Hi, I'm looking for a way to implement this without external components, but at the moment I'm busy with work so it will take a few months to get started on this issue.

nickjj commented 2 years ago

Hi,

Definitely not rushing you, but:

I'd love to be able to continue using this tool but not being able to upgrade nodes on our cluster without downtime is becoming an issue around not being able to apply security patches to the underlying EC2 boxes by creating new node groups without introducing downtime on our apps.

With that said, are you open to receiving USD donations to help expedite this ticket and any related tickets to reach an outcome of being able to do zero down time node upgrades while using pod-graceful-drain on your pods?

If so feel free to email me at nick.janetakis@gmail.com from the email address listed in your GitHub profile and we can go from there. If not, that's totally ok too. I fully understand you're not here to maintain your open source quality on anyone's terms except your own. :D

Also if you need any assistance in testing anything (regardless of your decision) let me know. Happy to do anything I can do.

foriequal0 commented 2 years ago

Hi,

Thank you for your consideration! But I don't think I'm able to receive your donation. I have a very tight schedule at my current job for now 😢. So I can't give you a specific time frame. However, I want you to understand that I didn't give up on this issue. I'll try to squeeze in as possible since I also rely on this and have a bunch of pending security patches.

foriequal0 commented 2 years ago

Here's some news: Good: I told my boss that I'm going to work on this issue during the worktime. Bad: The schedule situation isn't changed. Ugly: We've started crunching.

nickjj commented 2 years ago

It's all good. It is nice to see you're using this at work too.

nickjj commented 1 year ago

Hi,

Where do you stand on this since it's been a bit? I'm just probing for information, not trying to push you into anything!

On the bright side, your project is so wildly useful that we're happily using it and would love to continue using it.

foriequal0 commented 1 year ago

I've been busy launching a new service on my team, and I recently got some time to work on this project. I'm redesigning this project from scratch. It has more simple and sound structure. I'll let you know when I'm ready.

nkkowa commented 1 year ago

@nickjj potential workaround:

Using selectors/taints, you can run this on separate nodes from where the workloads requiring the graceful shutdown run.

For example, on AWS we have all our controllers running in one EKS ASG, and then our services running in another. For upgrades or scaling events, we can do these two groups independently.

nickjj commented 1 year ago

@nkkowa We have a small cluster with (2) nodes with no auto-scaling and we run (2) replicas of most of our services. This tool itself ends up always running on 1 of the nodes. We haven't yet configured anything at the K8s level to guarantee 1 replica runs on each node because K8s has handled this pretty well out of the box.

Were you suggesting something that could still work with that in mind? I also update the cluster usually with Terraform (EKS module) which handles the whole taint+drain+etc. process.

nkkowa commented 1 year ago

@nickjj Ah, without an architecture that has dedicated nodes for controllers/operators I don't think my setup would work for you 😞

foriequal0 commented 1 year ago

I've been busy launching a new service on my team, and I recently got some time to work on this project. I'm redesigning this project from scratch. It has more simple and sound structure. I'll let you know when I'm ready.

It's taking more time to write tests and polish up non-critical features. The core functionality were almost finished already in just a first few days. I'll pre-release it without these non-critical features soon.

nickjj commented 1 year ago

Sounds great. This is a critical tool in general, don't feel rushed!

I'd personally be happy to wait a bit longer to have extensive test coverage and features that you deem are worthy. What are the non-critical features btw?

foriequal0 commented 1 year ago

Thank you. I was trying some leader election based minor optimizations.

nickjj commented 1 year ago

Hi,

Any updates on this? Not trying to be pushy in the slightest. EKS 1.23 will be end of life in a few months and I was wondering if I should upgrade the cluster with a minute of app downtime or potentially wait until this patch is available to do it without downtime.

foriequal0 commented 11 months ago

Hi,

I'm really sorry that I haven't replied to you too long. I had some hard time to take care of things around me due to personal/work issues. They are getting better and I expect that most of them would be resolved by the end of december this year.

I think you might be able to mitigate the downtime risk with carefully orchestrating the process until that. It might be like this:

  1. prepare new worker nodes.
  2. migrate pod-graceful-drain to one of those node. The rest should be still during this step, and the migration should be quick enough. Any of the rest might get downtime with 5xx error if they're evicted during this step since pod-graceful-drain might not be available (replica < 1) or behave unexpectedly (replica > 1) depending on how you migrate it.
  3. Check pod-graceful-drain is working correctly when it is stable in its new node.
  4. The rest can migrate while pod-graceful-drain is stable.
  5. remove old worker nodes if all migration is finished.
nickjj commented 11 months ago

It's all good, hope things are going well. I ended up taking the downtime approach a few months ago, it was less than a minute or so of downtime but it's very uncontrollable. Thanks for the checklist, hopefully in a few months it can go away!

nickjj commented 1 month ago

Hi,

Any updates on this? Sorry for following up again.

foriequal0 commented 1 month ago

Hi, sorry to reply too late. I'm preparing this PR: https://github.com/foriequal0/pod-graceful-drain/pull/40 I've grown my confidence with this rewrite. I'll prepare an RC release soon.

nickjj commented 1 month ago

Thanks, when PR #40 is released, how do you see the upgrade process being for existing installations? Would it be bumping the Helm version for pod-graceful-drain and deploying it or would it be more involved? Would all deployments need to be re-deployed too, etc.?

foriequal0 commented 1 month ago

Just bumping up the version and redeploying the pod-graceful-drain only, or re-installing pod-graceful-drain only would be okay. You can have multiple replicas on the cluster after the first redeployment.

foriequal0 commented 1 month ago

Hi. I've released v0.1.0-rc.3 recently. It still has rough edges, but it's been fine in our development environment for a few days. Multiple replicas have worked as intended so far. We've been using it without touching the helm values.yaml, so this was the upgrade command:

helm upgrade pod-graceful-drain pod-graceful-drain\
  --repo https://foriequal0.github.io/pod-graceful-drain \
  --version 0.1.0-rc.3
nickjj commented 1 month ago

Hi,

That's great news, thank you so much for working on this project. It has been a staple for us since we started using Kubernetes.

For rolling this out to production, do you suggest giving it a bit more time? For context, we don't have a dev or staging environment that runs our normal workload. We deploy straight to production. I can spin up a few dummy services in a test environment but as you know, that's never a great indicator of how something may work with your "real" workload.

Did the rough edges cause any side effects that could impact a deployment or service's availability?

foriequal0 commented 1 month ago

I don't think there will be a drastic change until the release. Just a little bit of tweaks and tests here and there will be expected.

I meant rough edges are something that bothers me with my confidence, such as logging, testing, redundant API requests, and some mild generic 'it works on my cluster' and 'what if' kind of concerns. I should be more confident since there are automatic tests against multiple versions of actual clusters unlike before (I had tested manually against clusters before). Ironically, the more you test, the more you are concerned about, especially about not-tested aspects.

Our dev cluster is for a team with around ~15 deploying developers, and ~100 non-dev members. It had been fine for a couple of days with daily ~10 deployments. We don't strictly measure availability, but there were no complaints about that during deployments.

I'll let you know in a week or two, or if something happens.

nickjj commented 1 month ago

Ironically, the more you test, the more you are concerned about, especially about not-tested aspects.

Right, the more you know how something works, the more you're surprised how any of it works.

Thanks for the details, I'll check it in a week or so and start using it.

nickjj commented 1 week ago

How's things looking on your end stability wise?

foriequal0 commented 4 days ago

We haven't had any 503, 504 in Sentry so far. I cannot guarantee the stability of future RC versions (I have some TODOs until release) so you'll want to pin to a specific version (v0.1.0-rc.3) until I release a stable v0.1.0

nickjj commented 4 days ago

Ok thanks. I will roll out v0.1.0-rc.3 tomorrow. Assume no news is good news.

nickjj commented 3 days ago

I still received 503s and 504s during a node upgrade with rc3 but I think it was because I didn't change replicaCount, it defaults to 1. As part of upgrading to rc3, would we also need to set replicaCount: 2 to achieve zero downtime node upgrades?

foriequal0 commented 3 days ago

At least 1 pod should be alive at all times. It can happen if you're draining a node where the pod-graceful-drain is. You might need to set up some combination of replicaCount: 2, pod disruption budget, or anti-affinity. I can mitigate it by delaying the self-termination when the pod is on the draining node.

nickjj commented 3 days ago

Thanks.

I did encounter a new issue. I've upgraded this cluster from 1.24 to 1.29 in 0.01 stages over the years while using v0.0.11 of pod graceful drain and the worker nodes always upgraded in a hands free way after ~8-10 minutes.

When upgrading from 1.29 to 1.30 I used rc3 and after 26 minutes it timed out with a PodEvictionFailure saying it hit the max number of retries trying to evict a pod on a node. I don't know which pod failed.

But as soon as I downgraded back to v0.0.11 and re-ran the Terraform EKS module to upgrade the worker nodes from 1.29 to 1.30 it worked and finished in 9 minutes.

Did anything change that could cause rc3 to not allow being evicted?