Closed yuvipanda closed 2 years ago
I think making this comparison is a great idea! I will start filling some of those empty buckets soon (and maybe add some other rows).
is is primarily driven by clunkiness of EKS and lack of features (see aws/containers-roadmap#724 for example), and has worked out well for us.
I agree on some clunkiness, but to me it mostly relate to the idea of managed nodes which is something I figured that if avoided is quite unproblematic. I have not used kops
though. I'll try list all the observed clunkiness of the various options in this issue together going onwards.
There's also two real ways to use EKS - via Terraform or eksctl. pangeo-data/terraform-deploy took the terraform approach, and our previous AWS setups used the eksctl approach. For terraform, you pretty much end up needing to use https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest if you want to use things other than managed nodegroups. Or you need to use eksctl. Ideally, everything will be managed by Terraform - but I think right now, the only way to do that is to use that terraform module. It's a pretty big module as well, and I remember running into enough (essential) clunkiness when working on the terraform-deploy repo to move away from it...
There's also two real ways to use EKS - via Terraform or eksctl.
CloudFormation as well if you are really buried in AWS land :wink:.
Ideally, everything will be managed by Terraform
I like the idea to keep things as agnostic of the could provider as we can, so terraform feel tempting... Even more when we are using terraform for the GCP stuff. Although I have to say, having played with kops in the last few weeks also gave that agnostic feeling I always like to experience. And the setup was pretty straightforward (even without non-prior experience with kops) besides some issues because of fast moving configuration I was dealing with 😛 .
One thing I am curious about (and I think the experience will give us) is the maintainability load that kops could potentially bring and if that is not too expensive pay to make for the customizability you now have...
One thing I am curious about (and I think the experience will give us) is the maintainability load that kops could potentially bring and if that is not too expensive pay to make for the customizability you now have...
Yeah, I feel like this will be the ultimate primary differentiator. When a kops master fails, what do we do?
When a kops master fails, what do we do?
We have a troubleshooting section here: https://kops.sigs.k8s.io/operations/troubleshoot/ But that's not enough. I think we should get experience from real failures... that means being exposed to kops for some time. How we could accelerate that learning phase?
Something like https://github.com/Netflix/chaosmonkey maybe? I'm not sure.
Yep, I was thinking of some tool to actually create random failures... so something like that project could help. But I am not sure as well, maybe we are still early for even that process.
@damianavila yeah, I agree that we're still early for that
Just wanted to note this comment thread from the OpenScapes support stuff. It sounds like things with kops
are complicated and require more special-casing and manual steps in general. What do people think about this proposal:
Stop using kops
, and migrate our current kops
-based clusters to eks
as soon as we can. Remove documentation about kops
and replace it with eks
-focused docs.
I am sure that both options have pros/cons, and certain situations where one is better than the other. But, right now we are spread pretty thin in terms of the different clouds we use, and our bottleneck is human capacity, cloud-specific expertise, and information silos. We should be standardizing on the smallest possible subset of options for our deployments, and "choosing" between EKS vs. kops gives us an unnecessary degree of freedom that makes it harder for everybody to get on the same page. Moreover, it feels like kops
requires more manual intervention, so eks
is the service we should use.
I suggest we take the following next steps:
kops
in this issue.kops
with EKS-focused docs instead.eks
Do others object to this proposal, or think we should take a different approach?
I'm in favor, @choldgraf
When we were trying out kops, EKS was about 74$ a month for the control plane
(now I think it's 44$?). In addition, we would have had to run at least one
node for our hub infra Together, this meant that for smaller users (like
Openscapes), the base cost of keeping the infrastructure running even with no
users can be pretty high - a few hundred dollars a month. With kops
the idea
was that we could run the hub infra on the master nodes as well, cutting down
this cost significantly.
However, running resilient kops
control plane is actually more expensive than
what EKS charges! You need big boxes, k8s control plane processes aren't cheap.
We discovered this the hard way when CarbonPlan was scaling up their hub and the
k8s api would just stop responding. See
https://github.com/2i2c-org/infrastructure/issues/524 and
https://github.com/2i2c-org/infrastructure/issues/526. This convinced us to move
to EKS, as the cost saving goal was actually not met.
While Openscapes hasn't had this issue (they do not use their hub as much as CarbonPlan does), it takes a lot of effort to maintain infra for both EKS and kops. As such, we should just move everyone to EKS and abandon kops.
The issue of base cost reduction for users who only ocassionally use their hubs
is still present, however - and something we should tackle. But kops
is not
the solution.
A very big +1 for anything that makes us use less tech options, in practice: @choldgraf's suggestion! The cost of having a higher complexity is certainly outweighing the cost of machines etc.
OK, I've updated the top comment here and rescoped https://github.com/2i2c-org/infrastructure/issues/737 to cover migrating to eks
. I'll close this one!
Belated 👍 as well. I think kops is actually a pretty interesting beast and I do think it could be an interesting approach in certain scenarios... but in our current state, we should build on top and leverage cloud providers' infrastructure as long as we can to be more efficient.
We currently use kops to manage AWS clusters. This is primarily driven by clunkiness of EKS and lack of features (see https://github.com/aws/containers-roadmap/issues/724 for example), and has worked out well for us. However, this does mean that we're responsible for the k8s master - and that's a pretty big responsibility! I also made the decision single-handedly at that time, and it's useful to properly evaluate it with some set criteria I think.
We initially started with EKS, and this issue documents some of the process behind switching to
kops
. However, it's not structured enough to give me confidence that it is the right thing to do, and to help re-evaluate the decision as EKS is fast moving.Resolution
We've decided to use
EKS
instead ofkops
for AWS. See this comment for a rationale: https://github.com/2i2c-org/infrastructure/issues/431#issuecomment-968495599