Support for alternative runtimes

mateuszdrab commented 2 years ago

Docker recently started supporting alternative runtimes with the --runtime flag. This allows users to use alternative runtimes to runc such as Kata (which I'd like to use on some pods due to additional benefits of kernel isolation on external facing applications). Additionally, cri-dockerd support was just merged into K3s, allowing users to continue to use Docker post upgrade to Kubernetes 1.24 and above.

Unfortunately, those who choose to keep using Docker with Kubernetes will be constrained to the lack of alternative runtime support meaning native Kubernetes support in form of RuntimeClasses is made redundant as the CRI layer is a bottleneck.

I think adding support for passing through the requested runtime should not be too difficult so I was wondering if it's something that could be implemented?

lidajia commented 1 year ago

need --runtime flag support

0xE282B0 commented 1 year ago

Hi, I'd like to start this discussion again, for my own purpose I solved this problem like this: https://github.com/KWasm/cri-dockerd/pull/2

There are two problems, in this example I use the containerd-shim-spin:

The RunPodSandbox call contains the runtimeClassName e.g. spin but Docker needs io.containerd.spin.v1. That value is then passed to containerd which is looking for a binary called containerd-shim-spin-v1.
The second problem is that the RunPodSandbox call contains the runtimeClassName but starts a pause container which should be handled by runc, but the CreateContainer call does not contain the runtimeClassName but it is needed here to create the container with the right runtime.

As a solution to the first problem I introduced the --runtime-handler flag which is a map[string]string and is used to define the mapping at startup e.g. --runtime-handler spin=io.containerd.spin.v1

For the second problem, I had no better idea than to store the runtime handler as a label of the pause docker container which represents the pod. May someone has a better idea.

What do you think?

evol262 commented 1 year ago

Ok, so:

Docker does not "need" that particular formatting. containerd-compatible runtimes do, but they are two separate things, which is part of the reason hat this was not initially addressed. The user stories need some scoping. When you did this, did you also configure it in daemon.json, or pass a container-compatible runtime blindly?
Docker also allows setting a default runtime. We could make pause configurable, but "use foo.bar as a default runtime if the container doesn't specify" is also a Docker config option, which would apply to pause here (for your particular use case, globally -- in general, the "normal" runtime is almost certainly the "right" one for a container as do-nothing as pause).

From a practical POV, the kind of validation in your PR doesn't really belong in cri-dockerd. Installation/management of alternative runtimes (either the actual binary or the configuration of /etc/docker/daemon.json) is an administrator concern, and passing it off to Docker and letting it try to spawn a container is a more coherent workflow than trying to fail slightly sooner unless the right set of options is specified in a second location (no matter what cri-dockerd thinks, if docker itself doesn't know about the alternative runtime, it's still going to fail there).

The --runtime flag for docker still checks its config. We can pass it here, but the validation/regexp logic, allowing a different place to set a default (which is not in daemon.json), and others aren't in the right scope of responsibility here, since they must be set in Docker anyway, and it's the source of authority.

neersighted commented 1 year ago

@evol262 This is a bit more complicated for two reasons:

daemon.json runtime config only concerns itself with the concrete runtime invoked by the containerd runc shim. It more-or-less is the path to runc (or crun).
The only way to set a non-runc-shim runtime is to use --runtime in the individual container creation -- we special-case containerd shim names.

In short the config for setting a default runtime is currently a bit busted as the code is so brittle that it's blocked on a major refactor of config-handling. So for now, --runtime is the way to do this, unless you want to change the path of runc/use a drop-in replacement like crun.

0xE282B0 commented 1 year ago

Thanks, @evol262!

I just realized that I was confused by the term "alternative runtimes", it means other OCI compliant runtimes than runc, like crun, youki, or gvisor. Right? And they can be configured in daemon.json and passed by docker run --runtime <name_configured_in_daemon.json>.

What I am working on is using alternative containerd shims like described in https://docs.docker.com/desktop/wasm/. And alternative containerd shims are also passed via docker run --runtime <io,containerd.ShimNane.Version>.

Thanks @neersighted, would it be an option to pass the runtimeClassName as it is and we change the daemon.json so that it is possible to add containerd shims as runtime like:

 "runtimes": {
    "spin": {
      "shim": "io.containerd.spin.v1"
    },
 ...

That would address @evol262 concern that docker should be the leading system to manage the runtimes (and shims).

neersighted commented 1 year ago

I think it would make sense to pass values through without any opinion in cri-dockerd. I just explained why daemon.json is not an option right now -- that will take some time, upstream, to fix.

I also want to push back on your config snippet -- shim is not the right level of abstraction, as to containerd, the shim is the runtime. Instead we've leaked the io.containerd.runc.v2 implementation details (runc, crun, other compatible implementations) into Moby, which was a mistake caused by the migration from the v1 to v2 runtime APIs.

(or as @corhere explained it to me, the shim is the OCI runtime and runc is just an implementation detail; this makes sense when you consider the Kata and gVisor process models)

In short, no opinions in cri-dockerd makes sense, and the work on making alternate runtimes easier to use continues upstream, though most of the people involved in making it happen are working on other issues right now. I can bring it up in today's maintainer's call to see where people are at on returning to it.

0xE282B0 commented 1 year ago

Understood, I can see that there is no way to solve this problem in cri-containerd. For my purpose, a hacky solution is good enough, so I have no need to push further on this topic.

Thanks for the discussion 😊

corhere commented 1 year ago

1. Docker does not "need" that particular formatting. containerd-compatible runtimes do, but they are two separate things, which is part of the reason hat this was not initially addressed. The user stories need some scoping. When you did this, did you also configure it in daemon.json, or pass a container-compatible runtime blindly?

Blindling passing a containerd-compatible runtime name is the correct way to use it. Any shimv2 runtime on containerd's PATH can be used with Docker without any daemon.json configuration. See https://github.com/moby/moby/pull/43800 for more details. From the Engine API client's perspective, there is no meaningful distinction between shimv2 runtimes and configured-in-daemon.json runtimes.

2. in general, the "normal" runtime is almost certainly the "right" one for a container as do-nothing as pause

I disagree. Mixing runtimes for containers in a pod could break Kata and other runtimes which aren't namespace-based. Kata, for instance, runs all containers in a pod inside a single VM and likely wouldn't take too kindly to being asked to start a container in a sandbox it didn't also create.

In short the config for setting a default runtime is currently a bit busted as the code is so brittle that it's blocked on a major refactor of config-handling. So for now, --runtime is the way to do this, unless you want to change the path of runc/use a drop-in replacement like crun.

No, shimv2 runtimes can be set as the default. https://github.com/moby/moby/pull/43978

would it be an option to pass the ~runtimeClassName~ RuntimeClass handler as it is ~and we change the daemon.json~ so that it is possible to add containerd shims as runtime

FTFY. Kubernetes RuntimeClass objects are already a mapping from names to runtimes (runtimeClassName -> handler) and the Docker daemon has a mapping from runtimes to configs (explicit daemon.json config for legacy, implicit identity mappings for now for shimv2) so there is no need to put another layer of mapping names to runtimes into cri-dockerd.

neersighted commented 1 year ago

Looks like I missed the follow-up PR, thanks for clarifying @corhere!

evol262 commented 1 year ago

Blindling passing a containerd-compatible runtime name is the correct way to use it. Any shimv2 runtime on containerd's PATH can be used with Docker without any daemon.json configuration. See moby/moby#43800 for more details. From the Engine API client's perspective, there is no meaningful distinction between shimv2 runtimes and configured-in-daemon.json runtimes.

Blindly passing a containerd-compatible runtime name is the correct way to use it for moby. cri-dockerd can take it as a given that the administrator has hopefully set up the right node annotations so the scheduler will place it on somewhere that it's configured, but the feedback loop around failing to schedule/start when blindly passing is terrible, and a richer way to do it is much better.

I disagree. Mixing runtimes for containers in a pod could break Kata and other runtimes which aren't namespace-based. Kata, for instance, runs all containers in a pod inside a single VM and likely wouldn't take too kindly to being asked to start a container in a sandbox it didn't also create.

It's not mixing runtimes for containers. The only purpose of the pod-infra-container-image is to create a namespace which the containers can graft onto, which is completely irrelevant with kata anyway, since the isolation is via KVM, and support for Kata with pods is finicky enough anyway, because the container's config needs to be annotated to tell Kata how to handle it regardless -- something which cri-dockerd can't do anyway, since docker/moby handles that passthrough to containerd. It's not in the scope here.

corhere commented 1 year ago

Blindly passing a containerd-compatible runtime name is the correct way to use it for moby. cri-dockerd can take it as a given that the administrator has hopefully set up the right node annotations so the scheduler will place it on somewhere that it's configured, but the feedback loop around failing to schedule/start when blindly passing is terrible, and a richer way to do it is much better.

From a practical POV, the kind of validation in your PR doesn't really belong in cri-dockerd. Installation/management of alternative runtimes (either the actual binary or the configuration of /etc/docker/daemon.json) is an administrator concern, and passing it off to Docker and letting it try to spawn a container is a more coherent workflow than trying to fail slightly sooner unless the right set of options is specified in a second location (no matter what cri-dockerd thinks, if docker itself doesn't know about the alternative runtime, it's still going to fail there).

So should cri-dockerd validate the name of the runtime it passes to the daemon, or not?

It's not mixing runtimes for containers.

What were you suggesting, then? I read it as "don't bother starting the pause container with the pod's configured runtime; using the daemon's default runtime to start pause containers is always fine, even when the workload containers are to be started with a different runtime." Using a different runtime to start the sandbox container than the pod's application containers would lead to broken behaviour with some runtimes which use isolation techniques other than Linux namespaces.

The only purpose of the pod-infra-container-image is to create a namespace which the containers can graft onto, which is completely irrelevant with kata anyway, since the isolation is via KVM

That's the only purpose of the pause container with namespace-based runtimes. The image may be irrelevant to Kata, but the act of starting a pause container is necessary for other purposes.

support for Kata with pods is finicky enough anyway, because the container's config needs to be annotated to tell Kata how to handle it regardless -- something which cri-dockerd can't do anyway, since docker/moby handles that passthrough to containerd. It's not in the scope here.

This feature request is titled "Support for alternative runtimes." Kata is an alternative runtime. While it may be the case that extra work above and beyond what is being considered here may be needed before cri-dockerd can be compatible with Kata or some other runtime, I don't see why that would justify making design decisions which would would be actively hostile to allowing cri-dockerd to be compatible with a subset of runtimes in the future.

As an aside, I have been working on adding first-class support for Kata to moby. While cri-dockerd can't ask the daemon to add OCI annotations today, it'd be fairly trivial to add that functionality into the daemon. Kata is already ready for us, too.

evol262 commented 1 year ago

So should cri-dockerd validate the name of the runtime it passes to the daemon, or not?

No, but frankly, "just pass it blindly and hope it doesn't bomb with no meaningful feedback" when we pass unsanitized input to Docker isn't great either.

What were you suggesting, then? I read it as "don't bother starting the pause container with the pod's configured runtime; using the daemon's default runtime to start pause containers is always fine, even when the workload containers are to be started with a different runtime." Using a different runtime to start the sandbox container than the pod's application containers would lead to broken behaviour with some runtimes which use isolation techniques other than Linux namespaces.

I'm suggesting that this is not even remotely the right place to have that discussion, that k8s doesn't actually care about the "sandbox container" except insofar as it providing somewhere the rest of them can hook onto (which could also be a VM), but that the discussion is utterly moot until we can ask the daemon to add OCI annotations.

I'm suggesting "don't bother getting into thorny discussions about edge cases which are not even technically fulfillable yet and cherry pick a driver which has a different isolation model as an edge case. That discussion can happen once the rest of the pieces are there.

That's the only purpose of the pause container with namespace-based runtimes. The image may be irrelevant to Kata, but the act of starting a pause container is necessary for other purposes.

@corhere. Stop. That is the purpose of the pause container in the kubernetes model. It provides a holding point for a shared process/network namespace and a nominal PID1. The "act of starting a pause container" is functionally equivalent to "tell Kata via annotations to start a new VM and stick this container in it".

Really, this discussion is not appropriate here. It is mixing paradigms in a bad way between what k8s does/expects and other container stuff.

This feature request is titled "Support for alternative runtimes." Kata is an alternative runtime. While it may be the case that extra work above and beyond what is being considered here may be needed before cri-dockerd can be compatible with Kata or some other runtime, I don't see why that would justify making design decisions which would would be actively hostile to allowing cri-dockerd to be compatible with a subset of runtimes in the future.

Again, stop. Step away from the GH issue. "We're not going to do anything right now because there is not enough in Docker to hook onto to provide a user experience or even develop user stories" is not actively hostile. It is, quite literally, the status quo with the current state.

As an aside, I have been working on adding first-class support for Kata to moby. While cri-dockerd can't ask the daemon to add OCI annotations today, it'd be fairly trivial to add that functionality into the daemon. Kata is already ready for us, too.

Great. We'll wait for it.

evol262 commented 1 year ago

To be perfectly clear, the scope of this is small. Hypothetical discussions about what could or could not (or should or should not) be done about firecracker, kata, and other container drivers with different isolation models is a waste of time and energy for everyone involved until whatever time the support is actually in the daemon, exposed in the API, and widely-available enough that there's an actual user request for fulfilling it.

Kata/Firecracker need to start some container just for coherency with what the k8s domain model says should be there, but "these other container drivers which aren't in a state where cri-dockerd could meaningfully consume them anyway don't use namespaces/cgroups for isolation" is both true and meaningless at this point in time.

In the meantime, the questions are narrow:

It's great that Moby can seamlessly call other CRI drivers now, without configuration first. The question for integrating here is "if unsanitized input is passed in blindly to the API, what does cri-dockerd see if something goes wrong"?

If we actually get back some err, then we could start doing it. If not, then it waits (as it has for the past 6 months) until there is one, for our own sanity.

corhere commented 1 year ago

The /containers/{}/start request will return an HTTP 500 response if the runtime is unavailable.

< HTTP/1.1 500 Internal Server Error
< Api-Version: 1.43
< Content-Type: application/json
< Docker-Experimental: false
< Ostype: linux
< Server: Docker/dev (linux)
< Date: Fri, 17 Feb 2023 03:41:04 GMT
< Content-Length: 218
<
{"message":"failed to create task for container: failed to start shim: failed to resolve runtime path: runtime \"io.containerd.foo.bar\" binary not installed \"containerd-shim-foo-bar\": file does not exist: unknown"}

evol262 commented 1 year ago

Thanks @corhere. Getting an err is much better than a less descriptive result we'd need to suss out, and while I don't love the "match an error against some regexp" stuff we inherited from the original dockershim, that's easily consistent enough to match up and provide a more concise message back to k8s

corhere commented 1 year ago

Hypothetical discussions about what could or could not (or should or should not) be done about firecracker, kata, and other container drivers with different isolation models is a waste of time and energy for everyone involved until whatever time the support is actually in the daemon, exposed in the API, and widely-available enough that there's an actual user request for fulfilling it.

docker run --runtime io.containerd.kata.v2 works today with v23.0.1. This thread is the user request. The only thing hypothetical is support from cri-dockerd. Are there any other blockers on the daemon side aside from the annotation?

0xE282B0 commented 1 year ago

FTFY. Kubernetes RuntimeClass objects are already a mapping from names to runtimes (runtimeClassName -> handler) and the Docker daemon has a mapping from runtimes to configs (explicit daemon.json config for legacy, implicit identity mappings for now for shimv2) so there is no need to put another layer of mapping names to runtimes into cri-dockerd.

I agree that cri-dockerd should not have a mapping and I just hacked something together for a demo. But in my example I created a Pod runtimeClassName: wasmedge which is mapped to a handler by the following runtime class:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: wasmedge
handler: io.containerd.wasmedge.v1

That leads to the error that the handler is not valid The RuntimeClass "wasmedge" is invalid: handler: Invalid value: "io.containerd.wasmedge.v1": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')

@corhere When I can't have io.containerd.wasmedge.v1 as handler value which valid value can I pass to docker as runtime that is explicit or implicit mapped to 'io.containerd.wasmedge.v1'? (to run this example on Kubernetes)

corhere commented 1 year ago

That leads to the error that the handler is not valid

That's unfortunate. I misread the Kubernetes docs and overlooked that dots are disallowed in handler strings. So there will need to be a way to configure aliases for shimv2 runtimes in the daemon to unblock this.

I've opened https://github.com/moby/moby/issues/45030 to track any necessary daemon-side work.

evol262 commented 1 year ago

docker run --runtime io.containerd.kata.v2 works today with v23.0.1. This thread is the user request. The only thing hypothetical is support from cri-dockerd. Are there any other blockers on the daemon side aside from the annotation?

That is a blocker, yes.

The other blocker is "other than a single query for discussion six months ago, there was no work done on this because the amount of development time for it is small, and 'nice to have' questions for discussion about eventual implementation with no concrete use case don't get prioritized."

Patches welcome. When the daemon-side work is done would be a good time. Not to put too fine a point on it, but the speed of development here is often driven by the fact that there's only a very small amount of people writing patches, and the discrete needs (rather than asks, mostly) tend to get the focus.

moby/moby#45032 got out really fast (thanks!), but it's a poor solution for us, even after moby/moby#45032 is applied. cri-dockerd doesn't have a meaningful way to actually figure out what the backing runtime is. api.types.Runtime isn't surfaced in the client. cri-dockerd can pass a bare string, but that's potentially just an alias, and we have no real idea whether or not we should add them at all, despite the ability to get/set them. It's still more or less blind.

It's true that that more or less matches the containerd experience, and it's a fine starting point, but it's also true that we have room to do a little bit better even by surfacing daemon.config.Config.Runtimes up to the client

corhere commented 1 year ago

The daemon runtimes config is surfaced to the client at GET /info.

# curl -s --unix-socket /var/run/docker.sock http://./v1.42/info | jq .Runtimes
{
  "crun": {
    "runtimeType": "io.containerd.runc.v2",
    "options": {
      "BinaryName": "/usr/local/bin/crun"
    }
  },
  "runc": {
    "path": "runc"
  }
}

evol262 commented 1 year ago

Yes, but speaking of potential RPC overhead, it's not exactly GraphQL, and there's a lot of fields to serialize/deserialize if we hit potentially that every single time a container is scheduled with some custom RuntimeClass.

Or take the risk of caching it aside in cri-dockerd and hoping that it stays valid.

zhangguanzhang commented 4 months ago

any update? I found pod.RuntimeClass not set to non runc, it always be runc

default-runtime is runc, and runtime list:

# curl -s --unix-socket /var/run/docker.sock http://./v1.42/info | jq .Runtimes
{
  "io.containerd.runc.v2": {
    "path": "runc"
  },
  "nvidia": {
    "path": "nvidia-container-runtime"
  },
  "runc": {
    "path": "runc"
  }
}

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: docker
---
apiVersion: v1
kind: Pod
metadata:
....
    spec:
      runtimeClassName: nvidia

$ docker inspect be4d | grep -i runtime
            "Runtime": "runc",

Mirantis / cri-dockerd

Support for alternative runtimes #111