2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
105 stars 64 forks source link

Guidelines for using kops vs EKS #431

Closed yuvipanda closed 2 years ago

yuvipanda commented 3 years ago

We currently use kops to manage AWS clusters. This is primarily driven by clunkiness of EKS and lack of features (see https://github.com/aws/containers-roadmap/issues/724 for example), and has worked out well for us. However, this does mean that we're responsible for the k8s master - and that's a pretty big responsibility! I also made the decision single-handedly at that time, and it's useful to properly evaluate it with some set criteria I think.

We initially started with EKS, and this issue documents some of the process behind switching to kops. However, it's not structured enough to give me confidence that it is the right thing to do, and to help re-evaluate the decision as EKS is fast moving.

Resolution

We've decided to use EKS instead of kops for AWS. See this comment for a rationale: https://github.com/2i2c-org/infrastructure/issues/431#issuecomment-968495599

damianavila commented 3 years ago

I think making this comparison is a great idea! I will start filling some of those empty buckets soon (and maybe add some other rows).

consideRatio commented 3 years ago

is is primarily driven by clunkiness of EKS and lack of features (see aws/containers-roadmap#724 for example), and has worked out well for us.

I agree on some clunkiness, but to me it mostly relate to the idea of managed nodes which is something I figured that if avoided is quite unproblematic. I have not used kops though. I'll try list all the observed clunkiness of the various options in this issue together going onwards.

yuvipanda commented 3 years ago

There's also two real ways to use EKS - via Terraform or eksctl. pangeo-data/terraform-deploy took the terraform approach, and our previous AWS setups used the eksctl approach. For terraform, you pretty much end up needing to use https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest if you want to use things other than managed nodegroups. Or you need to use eksctl. Ideally, everything will be managed by Terraform - but I think right now, the only way to do that is to use that terraform module. It's a pretty big module as well, and I remember running into enough (essential) clunkiness when working on the terraform-deploy repo to move away from it...

damianavila commented 3 years ago

There's also two real ways to use EKS - via Terraform or eksctl.

CloudFormation as well if you are really buried in AWS land :wink:.

Ideally, everything will be managed by Terraform

I like the idea to keep things as agnostic of the could provider as we can, so terraform feel tempting... Even more when we are using terraform for the GCP stuff. Although I have to say, having played with kops in the last few weeks also gave that agnostic feeling I always like to experience. And the setup was pretty straightforward (even without non-prior experience with kops) besides some issues because of fast moving configuration I was dealing with 😛 .

One thing I am curious about (and I think the experience will give us) is the maintainability load that kops could potentially bring and if that is not too expensive pay to make for the customizability you now have...

yuvipanda commented 3 years ago

One thing I am curious about (and I think the experience will give us) is the maintainability load that kops could potentially bring and if that is not too expensive pay to make for the customizability you now have...

Yeah, I feel like this will be the ultimate primary differentiator. When a kops master fails, what do we do?

damianavila commented 3 years ago

When a kops master fails, what do we do?

We have a troubleshooting section here: https://kops.sigs.k8s.io/operations/troubleshoot/ But that's not enough. I think we should get experience from real failures... that means being exposed to kops for some time. How we could accelerate that learning phase?

yuvipanda commented 3 years ago

Something like https://github.com/Netflix/chaosmonkey maybe? I'm not sure.

damianavila commented 3 years ago

Yep, I was thinking of some tool to actually create random failures... so something like that project could help. But I am not sure as well, maybe we are still early for even that process.

yuvipanda commented 3 years ago

@damianavila yeah, I agree that we're still early for that

choldgraf commented 2 years ago

Just wanted to note this comment thread from the OpenScapes support stuff. It sounds like things with kops are complicated and require more special-casing and manual steps in general. What do people think about this proposal:

Proposal

Stop using kops, and migrate our current kops-based clusters to eks as soon as we can. Remove documentation about kops and replace it with eks-focused docs.

Rationale

I am sure that both options have pros/cons, and certain situations where one is better than the other. But, right now we are spread pretty thin in terms of the different clouds we use, and our bottleneck is human capacity, cloud-specific expertise, and information silos. We should be standardizing on the smallest possible subset of options for our deployments, and "choosing" between EKS vs. kops gives us an unnecessary degree of freedom that makes it harder for everybody to get on the same page. Moreover, it feels like kops requires more manual intervention, so eks is the service we should use.

Next steps

I suggest we take the following next steps:

Thoughts?

Do others object to this proposal, or think we should take a different approach?

yuvipanda commented 2 years ago

I'm in favor, @choldgraf

yuvipanda commented 2 years ago

When we were trying out kops, EKS was about 74$ a month for the control plane (now I think it's 44$?). In addition, we would have had to run at least one node for our hub infra Together, this meant that for smaller users (like Openscapes), the base cost of keeping the infrastructure running even with no users can be pretty high - a few hundred dollars a month. With kops the idea was that we could run the hub infra on the master nodes as well, cutting down this cost significantly.

However, running resilient kops control plane is actually more expensive than what EKS charges! You need big boxes, k8s control plane processes aren't cheap. We discovered this the hard way when CarbonPlan was scaling up their hub and the k8s api would just stop responding. See https://github.com/2i2c-org/infrastructure/issues/524 and https://github.com/2i2c-org/infrastructure/issues/526. This convinced us to move to EKS, as the cost saving goal was actually not met.

While Openscapes hasn't had this issue (they do not use their hub as much as CarbonPlan does), it takes a lot of effort to maintain infra for both EKS and kops. As such, we should just move everyone to EKS and abandon kops.

The issue of base cost reduction for users who only ocassionally use their hubs is still present, however - and something we should tackle. But kops is not the solution.

consideRatio commented 2 years ago

A very big +1 for anything that makes us use less tech options, in practice: @choldgraf's suggestion! The cost of having a higher complexity is certainly outweighing the cost of machines etc.

choldgraf commented 2 years ago

OK, I've updated the top comment here and rescoped https://github.com/2i2c-org/infrastructure/issues/737 to cover migrating to eks. I'll close this one!

damianavila commented 2 years ago

Belated 👍 as well. I think kops is actually a pretty interesting beast and I do think it could be an interesting approach in certain scenarios... but in our current state, we should build on top and leverage cloud providers' infrastructure as long as we can to be more efficient.