Discussion: why not Kubernetes?

inc0 commented 2 years ago

To be clear, I'm not advocating for k8s, I just think we will be asked this question and we should have robust answer. This is supposed to be more "food for thought" kind of issue. I hope we can gather good list of gaps that actually prevents us from achieving things we want on Kube, and help us ensure that we won't fall into same traps with Aurae.

According to #4, Auraes main difference to Kubernetes is an app manifest that's actually programming language (as opposed to yaml). That provides flexibility to dynamically define app runtime. It allows logic like "get this image and push it to core docker registry". All that is wonderful utility for users. Aurae value, as of today, is on client and user experience side, which sucks for k8s.

I also think this could be achieved with very opinionated deployment of k8s (setup registry with all the certificates for image pushing, setup storage both block and object etc.) and well crafted client library that would understand these components and be able to communicate with them. Imagine client library (in any programming language) that would query k8s api, pull addresses and certificates for docker registry and setup port forwarding or network tunnel. With that and few other components like that, add some magic executor inside pod for dynamic changes to pod apps etc. Kubernetes becomes under-the-hood state machine that handles all the infra (runtime, networking, storage) and Aurae becomes a platform that takes it, other components on top of it and gives user coherent and convenient experience.

Pros of this approach would be:

Kube exists and is well maintained. Setting up Kube is matter of few clicks on any major public cloud and there are plenty of tools that already integrate with it well. helm install aurae + brew install aurae-client and we're off to races. We could focus on providing opinionated set of tools + fill gaps where they don't exist, and focus on designing wonderful client library.
Migrating from regular Kube to Aurae would be much easier. We could even provide tools to read Kube deployments etc and generate Aurae equivalent.

Now, let me start with things that I, personally, see being potential gaps in Kube that can't be solved by simple wrappers/operators/admission controllers/clients:

Resource architecture of Kube kind of sucks imo - It doesn't allow "composable" types. We have DaemonSet, Deployment, ReplicaSet, Job, CronJob etc etc. All of these just run pods in slightly different manner. How would you run script on every node in cluster every day? Only way to do it is to create CronJob that creates DeamonSet that has restartPolicy: Never, which is very similar to Job but not quite. It would be easier to create tools that allow to define each of these, somewhat unrelated, parts of manifest (where to run, when to run, how to restart) and compose type that fits our problem. Something like traits in Rust would be great

#[derive(Pod)]
struct CronDaemonset {}

impl Cron for CronDaemonset {
fn crontab(&self) -> CronTab {
    CronTab{"0 8 * * *"}
}
}

impl Scheduling for CronDaemonset {
fn select_nodes(nodes: Vec<Node>) -> Vec<Node> {
    nodes
}
}

Above example touches second issue I have with k8s - scheduling is very inflexible. Taints, tolerations allow you to model some behaviors, but there are whole slew of things you can't do (example I've hit was to utilize finished job count during scheduling). It's also very complex (requiredDuringSchedulingIgnoredDuringExecution is very clear about what it does, doesn't it? /s). I think we can do better with logic.

I'm sure there are other examples of these, let's think on them a bit please.

krisnova commented 2 years ago

This is a really great call out. I have pages of notes on this, and for the most part they all exist in my head at this point. I had been planning on putting together a blog on the topic before I open sourced things. However the original interest was too high for me to manage it in the private any longer.

The main reasons in my mind:

Duplication of "Userspace"

In my opinion Kubernetes has re-created Linux userspace and introduced a completely new instance of many of the same problems. For example a Kubernetes administrator would be tasked with managing a containerd runtime as well as the containers in Kubernetes. An administrator would be tasked with managing the corresponding config for both the "host" services as well as the "kube" services. Why do we have config in /etc on the host as well as a ConfigMap for a user?

Problems with centralized etcd

Where do I even start here? Etcd has caused problems since day 1 in my opinion. I think the database itself is fantastic. I think what Kubernetes did with it was a bit problematic. How do we address split-brain leader elections and sync consistency issues? Scale concerns? Managing credentials? A single point of failure? and so on?

Centralized decision making breaks at the edge

What happens if we want to schedule work on a node and that node goes "offline"? The entire "state" of Kubernetes comes with a network dependency. Why are we gaining by having a centralized data store when we could in stead just write the state directly where it runs?

The more I run production infrastructure, the more I am convinced that simple systems are the correct systems for scale. There is a lot of complexity that goes into having node level configuration managed centrally. I think we can do better.

Ingress needs some love

I think we made a few mistakes with the relationship between ingress, and namespaces. I think also we made a few assumptions about traffic that aren't as realistic as they could be.

How do we manage Ingress and TLS and the subsequent routing in a way that simplifies things and provides powerful capabilities to platform teams without getting into the iptables and kube-proxy business?

Infrastructure

Kube called infrastructure out on day 1, however required bespoke infrastructure to bring a cluster online. How do we improve the story of a "single node" to "a monolithic cluster" without elaborate installers, DNS requirements, TLS management, token management, firewall rules, etc, etc

We should be able to assemble a mesh of nodes as we need them, and leave the large clunky cluster management business out of it. Joining a node should be something that doesn't require an army of controllers and auth management to manage and repair. How do we improve the infrastructure story for ops teams without the complexity?

YAML

We put YAML in front of app teams and expect them to be able to manage one of the most complex systems on the internet. I think we can do better with some thought. My experience has convinced me that app teams want something turing complete. Every IaC project has eventually made its way to turing completion. Terraform. Chef. Puppet. Helm. I think just because we CAN express things in YAML doesn't mean we SHOULD.

Toggle Applications

We can't toggle things on/off. How do you kubectl apply -f without actually turning things on? How do you kubectl apply -f myservice.yaml without actually creating a load balancer? How do you still write the state to the cluster without actually reconciling the state?

We don't want to manage YAML we want to build systems

I think managing YAML is a consequence people doing what they need with the wrong tools. We don't actually want to manage YAML that is just the best thing we have at the moment. What we really want is a distributed editor and the ability to realize the consequences of a change without a tremendous amount of cognitive burden. We just want a google docs like experience with auth and identity long before we want to manage thousands of repositories.

Simplicity

Rust forces you to write small organic systems. Object oriented Go is going to produce monolithic object oriented systems. We literally built a Java-style monolith and convinced people it was going to move them away from the monolith model.

The language DOES have an influence on the shape of the project.

Config Maps, Namespaces, Labels, Annotations

These all feel like sorting paradigms that can just be solved with simple key/value architecture and a query syntax. Do we really need custom baked primitives for all of our labeling and flagging concerns? Every time I see a project use an annotation I cringe because its just moving variables up to yet-another-paradigm.

Multi Tenancy

The only way Kubernetes pays for itself is by running large, organic clusters with a plethora of tenants on the cluster. There are some concerns with this from a security perspective but I don't think we isolate workloads as well as we could. Hence why I want to bake MicroVM architecture in place of namespaces.

No kubernetes engineers were harmed in the making of this post. I am traumatized and exhausted and I literally just brain dumped into this github issue. My wife and my dog look concerned. I am going to take a break.

Nothing but love and respect to everyone in Kube. We wouldn't be here if it wasn't for them.

🏳️‍⚧️ 🏳️‍🌈

krisnova commented 2 years ago

I think its also important to call out that I have certainly identified the problems, and my small amount of research has my "intuition" pointing towards the Aurae/mesh model. I am going to follow the dopamine here and see what happens.

Just because I identified a problem, doesn't mean I have a plan for it (yet). It also doesn't mean that whatever we decide will somehow be any better that what we already have in Kube.

inc0 commented 2 years ago

I see 3 aspects to this discussion:

How cluster works on low level, what nodes do, how network works etc.
What's end user experience look like - aurae library, DSL/language clients
What features infra provides - image handling, storage (block? object? both?)

I'd say mesh discussion fits in 1 and that's where most of experimentation vs k8s would happen. That's where we could potentially just use k8s (at least in near term). 2 would be biggest end user value add that Aurae can possibly bring. This is where k8s really comes short. 3 would be a bit of both, once we define what we want to offer in AuraeDSL (I'll just call it so for sake of argument, client library and general UX part), we'll have to figure out what infra would be needed to address these and whether or not k8s can even provide it.

inc0 commented 2 years ago

Another discussion topic: how Aurae relates to Nomad? I don't have a ton of experience with it, but it's supposed to be simpler runtime than k8s, so maybe it would be good option for infra fabric under AuraeDSL

krisnova commented 2 years ago

More future architecture in this document.

krisnova commented 2 years ago

Please open a new issue for Nomad!

krisnova commented 2 years ago

Closing in favor of: https://medium.com/@kris-nova/why-fix-kubernetes-and-systemd-782840e50104

aurae-runtime / architecture