Alternative approach - Githubissues

micw commented 5 years ago

Hello, I like the idea to manage stateful lxc-containers with kubernetes. However, after reading the drawbacks and workarounds, I wonder if this is the "best" way to achieve this.

A while ago, I have created a small project that runs LXC within a container (https://github.com/micw/docker-lxc). This is just a PoC with many glitches but it works out of the box with an unmodified kubernetes and it runs stable (I run my a daily used archlinux as remote desktop on it for >1 year). One drawback is that the lxc-container dies if the docker container is restarted.

But I have thought if it could be possible to combine both approaches and I came to the following ideas:

instead of implementig this as CRI, a service could just listen to k8s events and start/stop lxd containers
to start an lxc-container, a special k8s docker container could be started. it's just a dummy but the event listening service could be triggered by creation/deletion/changes of such containers and update the lxc containers accordingly
the dummy container could also be used to proxy logs and console between lxc and kubernetes: it should contain an entrypoint that tailf the lxc container logs and it should contains a modified /bin/sh that connects to the container

What do you think about this?

This approach would also allow to run thet event listening service as well as the lxd as docker containers, removing all special reqirements for the deployment.

automaticserver commented 5 years ago

Listen for K8s-Events means to create a Kubelet. Creating a Kubelet is an amount of code which is a lot. We've checked the complexity of writing a LXC Shim or our own Kubelet and it was clear that we never could do that in a reasonable amount of time.

Internally we've packaged LXE of course and have additionally an enterprise extension to keep containers save (filters unneccessary delete commands from kubelet).

Kubelet does really a lot, please take a look at it: it checks your local system for RAM, CPU, filesystems, it mounts stuff, it creates new filesystems to mount, it organizes networking, it does evict pods, it delivers host metrics - it does really so much stuff, it's unbelieveable.

So overall: we're really happy with LXE and we can't recommend to write you own kubelet.

micw commented 5 years ago

Listen for K8s-Events means to create a Kubelet

No, it just means to listen to the events. There are multiple ways to do so.

One option is to add a livecycle listener to the container you want to watch.
another option is to connect to the API and watch changes (like ingress controllers do)
another option is an admission controller webhook

automaticserver commented 5 years ago

Then you will never create a pod object, if you just listen. Kubelet needs to know there is a pod, and it knows it's because there is a CRI implementation which gives success and reports the right things

And the other way round: LXE works so why thinking about a different way?

micw commented 5 years ago

I meant it different: I thought of running an unmodified kubernetes including kubelet and default cri or containerd which launches a normal kubernetes pod. this pod runs a special image that is the interface to lxc.

And the other way round: LXE works so why thinking about a different way?

Because this way some limitations could be removed: the special image name handling, the need of a custom configured kubelet, coexistence with docker containers on the same nodes

Edit: PS: I just want to share my thoughts about it and discuss, it's no approach to make you change things ;)

automaticserver commented 5 years ago

I see no big deal with 2 different types of nodes, especially in a virtual environment.

That kubernetes exposes interfaces of the container runtime is a design flaw, ok. But I don't get it: what is a "normal pod"? A docker container? And this docker contains another container runtime, LXC?

micw commented 5 years ago

what is a "normal pod"? A docker container?

yes

And this docker contains another container runtime, LXC?

No, my idea is to have this just as a "placeholder". If such a container is created, the event listener triggers the creation of an lxc container. If it's deleted, the listener deletes the lxc container. The lxc container could use the same cgroups as the docker container, so it would inherit the network.

dionysius commented 5 years ago

Hi @micw! I'm a bit late to the conversation. Thanks for your thoughts and its always good to have a different perspective. The "best" way usually depends on the projects goal. :) I get exactly what you mean, but let me show you some new challenges:

First, the hierarchy of a pod will look like this:

Pod (docker pause-container)
- Container (effective docker container) either dummy or some LXC magic thanks to privileged
- LXC Container (effective lxc container)
- (+other containers if there are more containers in the podspec)
- (+other containers per effective docker container)

This list is independent how it's implemented, but just shows how its result looks like you've described. So, the challenges are (just to name a few):

Networking: You have additional/different challenges in networking. The Pod is per definition the location where the PodNetwork is setup, gets an IP address, sets up the interface. The effective docker containers inherit this network namespace from the pod so they inherit the interfaces. Now, also the lxc containers have to inherit this network namespace, otherwise how would a network packet arrive at the lxc container (this is why lxe is still 1 container per pod, I have some attempts but still failed. In theory possible but thats out of scope now). You have to somehow inherit (or forward) the network configuration (or traffic), because the CRI implementation has to respond to kubelet the IP for the PodStatus (without IP, kubelet will kill the pod and try again). In the end, just saying there are some challenges here network-wise.
Limits: You want the resource requests and limits to be set on the lxc container, and not the docker container, but what you write in the podSpec applies to the docker container because the CRI shim is told to do so. You can probably copy this settings to lxc somehow, but then the same limits are set twice. Think when the node is under load, is it still fair among all containers? Now you might think: well I define these options somewhere else, next point:
Image name and additional parameters: Where do you write what lxc image you want? Since in the podSpec you have to write your dummy docker container, otherwise this container will not be created. So how do you know what lxc image you want to start a lxc container from? How to describe from which remote, limits, volume mounts, serviceaccount tokens, hostport, ..., etc. Where do you write all these? you have few options: podscoped are labels, annotations (using downwardAPI) and containerscoped are env - there are no labels for containers and such. You can go hard by defining a configmap and assigning that as a volume. But does this then sound easier overall when you have to describe multiple resources to get to that lxc container?

So, if you now think, if you don't want to tie the lxc container to the pod that tightly, you will loose all benefits from running a pod (scheduling, metrics, ...). I kinda drifted away sounding negative, but I don't mean it that way and just compiled the things coming to mind working on this project for a while now.

I'm also very open for supporting OCI, someone making a remote hub for the community, or finding a convenient way to directly "convert"(?) the OCI to a lxc container - I would find a way to incorporate that into LXE so we can have more flexibility. That will make a lot of things easier. But: If you manage to handle OCI, you can inject that to cri-o as plugin probably as long as you're happy with OCI images.

micw commented 5 years ago

@dionysius Thank you for that great feedback. In https://github.com/micw/docker-lxc I actually launch lxc within the container, giving exact the structure you describe. LXC creates it's cgroups as child of that containers cgroup, I start it with "Host network" mode, so it inherits the pod's network. Limits are also inherited from the pod. The imagename of the lxc container could be passed as env (alternatively as annotation or label). That's by the way also a thing you could do in your current implementation. Define a kind of marker imagename ("lxe") and expect the image in LXC_IMAGE env var.

dionysius commented 5 years ago

Yeah I saw that. Well, this projects goal is to use the lxd toolchain and to not use/depend on docker

automaticserver / lxe

Alternative approach #8