Closed mateuszdrab closed 4 months ago
need --runtime flag support
Hi, I'd like to start this discussion again, for my own purpose I solved this problem like this: https://github.com/KWasm/cri-dockerd/pull/2
There are two problems, in this example I use the containerd-shim-spin:
RunPodSandbox
call contains the runtimeClassName e.g. spin
but Docker needs io.containerd.spin.v1
. That value is then passed to containerd which is looking for a binary called containerd-shim-spin-v1
.RunPodSandbox
call contains the runtimeClassName but starts a pause container which should be handled by runc, but the CreateContainer
call does not contain the runtimeClassName but it is needed here to create the container with the right runtime.As a solution to the first problem I introduced the --runtime-handler
flag which is a map[string]string
and is used to define the mapping at startup e.g. --runtime-handler spin=io.containerd.spin.v1
For the second problem, I had no better idea than to store the runtime handler as a label of the pause docker container which represents the pod. May someone has a better idea.
What do you think?
Ok, so:
daemon.json
, or pass a container-compatible runtime blindly?pause
configurable, but "use foo.bar
as a default runtime if the container doesn't specify" is also a Docker config option, which would apply to pause
here (for your particular use case, globally -- in general, the "normal" runtime is almost certainly the "right" one for a container as do-nothing as pause
).From a practical POV, the kind of validation in your PR doesn't really belong in cri-dockerd
. Installation/management of alternative runtimes (either the actual binary or the configuration of /etc/docker/daemon.json
) is an administrator concern, and passing it off to Docker and letting it try to spawn a container is a more coherent workflow than trying to fail slightly sooner unless the right set of options is specified in a second location (no matter what cri-dockerd
thinks, if docker itself doesn't know about the alternative runtime, it's still going to fail there).
The --runtime
flag for docker
still checks its config. We can pass it here, but the validation/regexp logic, allowing a different place to set a default (which is not in daemon.json
), and others aren't in the right scope of responsibility here, since they must be set in Docker anyway, and it's the source of authority.
@evol262 This is a bit more complicated for two reasons:
--runtime
in the individual container creation -- we special-case containerd shim names.In short the config for setting a default runtime is currently a bit busted as the code is so brittle that it's blocked on a major refactor of config-handling. So for now, --runtime
is the way to do this, unless you want to change the path of runc/use a drop-in replacement like crun.
Thanks, @evol262!
I just realized that I was confused by the term "alternative runtimes", it means other OCI compliant runtimes than runc
, like crun
, youki
, or gvisor
. Right? And they can be configured in daemon.json
and passed by docker run --runtime <name_configured_in_daemon.json>
.
What I am working on is using alternative containerd shims like described in https://docs.docker.com/desktop/wasm/. And alternative containerd shims are also passed via docker run --runtime <io,containerd.ShimNane.Version>
.
Thanks @neersighted, would it be an option to pass the runtimeClassName
as it is and we change the daemon.json
so that it is possible to add containerd shims as runtime like:
"runtimes": {
"spin": {
"shim": "io.containerd.spin.v1"
},
...
That would address @evol262 concern that docker should be the leading system to manage the runtimes (and shims).
I think it would make sense to pass values through without any opinion in cri-dockerd. I just explained why daemon.json is not an option right now -- that will take some time, upstream, to fix.
I also want to push back on your config snippet -- shim
is not the right level of abstraction, as to containerd, the shim is the runtime. Instead we've leaked the io.containerd.runc.v2
implementation details (runc
, crun
, other compatible implementations) into Moby, which was a mistake caused by the migration from the v1 to v2 runtime APIs.
(or as @corhere explained it to me, the shim is the OCI runtime and runc is just an implementation detail; this makes sense when you consider the Kata and gVisor process models)
In short, no opinions in cri-dockerd makes sense, and the work on making alternate runtimes easier to use continues upstream, though most of the people involved in making it happen are working on other issues right now. I can bring it up in today's maintainer's call to see where people are at on returning to it.
Understood, I can see that there is no way to solve this problem in cri-containerd
.
For my purpose, a hacky solution is good enough, so I have no need to push further on this topic.
Thanks for the discussion 😊
1. Docker does not "need" that particular formatting. containerd-compatible runtimes do, but they are two separate things, which is part of the reason hat this was not initially addressed. The user stories need some scoping. When you did this, did you also configure it in
daemon.json
, or pass a container-compatible runtime blindly?
Blindling passing a containerd-compatible runtime name is the correct way to use it. Any shimv2 runtime on containerd's PATH
can be used with Docker without any daemon.json
configuration. See https://github.com/moby/moby/pull/43800 for more details. From the Engine API client's perspective, there is no meaningful distinction between shimv2 runtimes and configured-in-daemon.json
runtimes.
2. in general, the "normal" runtime is almost certainly the "right" one for a container as do-nothing as
pause
I disagree. Mixing runtimes for containers in a pod could break Kata and other runtimes which aren't namespace-based. Kata, for instance, runs all containers in a pod inside a single VM and likely wouldn't take too kindly to being asked to start a container in a sandbox it didn't also create.
In short the config for setting a default runtime is currently a bit busted as the code is so brittle that it's blocked on a major refactor of config-handling. So for now,
--runtime
is the way to do this, unless you want to change the path of runc/use a drop-in replacement like crun.
would it be an option to pass the ~
runtimeClassName
~ RuntimeClasshandler
as it is ~and we change thedaemon.json
~ so that it is possible to add containerd shims as runtime
FTFY. Kubernetes RuntimeClass
objects are already a mapping from names to runtimes (runtimeClassName
-> handler
) and the Docker daemon has a mapping from runtimes to configs (explicit daemon.json
config for legacy, implicit identity mappings for now for shimv2) so there is no need to put another layer of mapping names to runtimes into cri-dockerd.
Looks like I missed the follow-up PR, thanks for clarifying @corhere!
Blindling passing a containerd-compatible runtime name is the correct way to use it. Any shimv2 runtime on containerd's
PATH
can be used with Docker without anydaemon.json
configuration. See moby/moby#43800 for more details. From the Engine API client's perspective, there is no meaningful distinction between shimv2 runtimes and configured-in-daemon.json
runtimes.
Blindly passing a containerd-compatible runtime name is the correct way to use it for moby. cri-dockerd can take it as a given that the administrator has hopefully set up the right node annotations so the scheduler will place it on somewhere that it's configured, but the feedback loop around failing to schedule/start when blindly passing is terrible, and a richer way to do it is much better.
I disagree. Mixing runtimes for containers in a pod could break Kata and other runtimes which aren't namespace-based. Kata, for instance, runs all containers in a pod inside a single VM and likely wouldn't take too kindly to being asked to start a container in a sandbox it didn't also create.
It's not mixing runtimes for containers. The only purpose of the pod-infra-container-image is to create a namespace which the containers can graft onto, which is completely irrelevant with kata anyway, since the isolation is via KVM, and support for Kata with pods is finicky enough anyway, because the container's config needs to be annotated to tell Kata how to handle it regardless -- something which cri-dockerd can't do anyway, since docker/moby handles that passthrough to containerd. It's not in the scope here.
Blindly passing a containerd-compatible runtime name is the correct way to use it for moby. cri-dockerd can take it as a given that the administrator has hopefully set up the right node annotations so the scheduler will place it on somewhere that it's configured, but the feedback loop around failing to schedule/start when blindly passing is terrible, and a richer way to do it is much better.
From a practical POV, the kind of validation in your PR doesn't really belong in
cri-dockerd
. Installation/management of alternative runtimes (either the actual binary or the configuration of/etc/docker/daemon.json
) is an administrator concern, and passing it off to Docker and letting it try to spawn a container is a more coherent workflow than trying to fail slightly sooner unless the right set of options is specified in a second location (no matter whatcri-dockerd
thinks, if docker itself doesn't know about the alternative runtime, it's still going to fail there).
So should cri-dockerd validate the name of the runtime it passes to the daemon, or not?
It's not mixing runtimes for containers.
What were you suggesting, then? I read it as "don't bother starting the pause container with the pod's configured runtime; using the daemon's default runtime to start pause containers is always fine, even when the workload containers are to be started with a different runtime." Using a different runtime to start the sandbox container than the pod's application containers would lead to broken behaviour with some runtimes which use isolation techniques other than Linux namespaces.
The only purpose of the pod-infra-container-image is to create a namespace which the containers can graft onto, which is completely irrelevant with kata anyway, since the isolation is via KVM
That's the only purpose of the pause container with namespace-based runtimes. The image may be irrelevant to Kata, but the act of starting a pause container is necessary for other purposes.
support for Kata with pods is finicky enough anyway, because the container's config needs to be annotated to tell Kata how to handle it regardless -- something which cri-dockerd can't do anyway, since docker/moby handles that passthrough to containerd. It's not in the scope here.
This feature request is titled "Support for alternative runtimes." Kata is an alternative runtime. While it may be the case that extra work above and beyond what is being considered here may be needed before cri-dockerd
can be compatible with Kata or some other runtime, I don't see why that would justify making design decisions which would would be actively hostile to allowing cri-dockerd
to be compatible with a subset of runtimes in the future.
As an aside, I have been working on adding first-class support for Kata to moby. While cri-dockerd can't ask the daemon to add OCI annotations today, it'd be fairly trivial to add that functionality into the daemon. Kata is already ready for us, too.
So should cri-dockerd validate the name of the runtime it passes to the daemon, or not?
No, but frankly, "just pass it blindly and hope it doesn't bomb with no meaningful feedback" when we pass unsanitized input to Docker isn't great either.
What were you suggesting, then? I read it as "don't bother starting the pause container with the pod's configured runtime; using the daemon's default runtime to start pause containers is always fine, even when the workload containers are to be started with a different runtime." Using a different runtime to start the sandbox container than the pod's application containers would lead to broken behaviour with some runtimes which use isolation techniques other than Linux namespaces.
I'm suggesting that this is not even remotely the right place to have that discussion, that k8s doesn't actually care about the "sandbox container" except insofar as it providing somewhere the rest of them can hook onto (which could also be a VM), but that the discussion is utterly moot until we can ask the daemon to add OCI annotations.
I'm suggesting "don't bother getting into thorny discussions about edge cases which are not even technically fulfillable yet and cherry pick a driver which has a different isolation model as an edge case. That discussion can happen once the rest of the pieces are there.
That's the only purpose of the pause container with namespace-based runtimes. The image may be irrelevant to Kata, but the act of starting a pause container is necessary for other purposes.
@corhere. Stop. That is the purpose of the pause container in the kubernetes model. It provides a holding point for a shared process/network namespace and a nominal PID1. The "act of starting a pause container" is functionally equivalent to "tell Kata via annotations to start a new VM and stick this container in it".
Really, this discussion is not appropriate here. It is mixing paradigms in a bad way between what k8s does/expects and other container stuff.
This feature request is titled "Support for alternative runtimes." Kata is an alternative runtime. While it may be the case that extra work above and beyond what is being considered here may be needed before
cri-dockerd
can be compatible with Kata or some other runtime, I don't see why that would justify making design decisions which would would be actively hostile to allowingcri-dockerd
to be compatible with a subset of runtimes in the future.
Again, stop. Step away from the GH issue. "We're not going to do anything right now because there is not enough in Docker to hook onto to provide a user experience or even develop user stories" is not actively hostile. It is, quite literally, the status quo with the current state.
As an aside, I have been working on adding first-class support for Kata to moby. While cri-dockerd can't ask the daemon to add OCI annotations today, it'd be fairly trivial to add that functionality into the daemon. Kata is already ready for us, too.
Great. We'll wait for it.
To be perfectly clear, the scope of this is small. Hypothetical discussions about what could or could not (or should or should not) be done about firecracker, kata, and other container drivers with different isolation models is a waste of time and energy for everyone involved until whatever time the support is actually in the daemon, exposed in the API, and widely-available enough that there's an actual user request for fulfilling it.
Kata/Firecracker need to start some container just for coherency with what the k8s domain model says should be there, but "these other container drivers which aren't in a state where cri-dockerd could meaningfully consume them anyway don't use namespaces/cgroups for isolation" is both true and meaningless at this point in time.
In the meantime, the questions are narrow:
If we actually get back some err
, then we could start doing it. If not, then it waits (as it has for the past 6 months) until there is one, for our own sanity.
The /containers/{}/start
request will return an HTTP 500 response if the runtime is unavailable.
< HTTP/1.1 500 Internal Server Error
< Api-Version: 1.43
< Content-Type: application/json
< Docker-Experimental: false
< Ostype: linux
< Server: Docker/dev (linux)
< Date: Fri, 17 Feb 2023 03:41:04 GMT
< Content-Length: 218
<
{"message":"failed to create task for container: failed to start shim: failed to resolve runtime path: runtime \"io.containerd.foo.bar\" binary not installed \"containerd-shim-foo-bar\": file does not exist: unknown"}
Thanks @corhere. Getting an err
is much better than a less descriptive result we'd need to suss out, and while I don't love the "match an error against some regexp" stuff we inherited from the original dockershim, that's easily consistent enough to match up and provide a more concise message back to k8s
Hypothetical discussions about what could or could not (or should or should not) be done about firecracker, kata, and other container drivers with different isolation models is a waste of time and energy for everyone involved until whatever time the support is actually in the daemon, exposed in the API, and widely-available enough that there's an actual user request for fulfilling it.
docker run --runtime io.containerd.kata.v2
works today with v23.0.1. This thread is the user request. The only thing hypothetical is support from cri-dockerd. Are there any other blockers on the daemon side aside from the annotation?
FTFY. Kubernetes RuntimeClass objects are already a mapping from names to runtimes (runtimeClassName -> handler) and the Docker daemon has a mapping from runtimes to configs (explicit daemon.json config for legacy, implicit identity mappings for now for shimv2) so there is no need to put another layer of mapping names to runtimes into cri-dockerd.
I agree that cri-dockerd should not have a mapping and I just hacked something together for a demo. But in my example I created a Pod runtimeClassName: wasmedge
which is mapped to a handler by the following runtime class:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: wasmedge
handler: io.containerd.wasmedge.v1
That leads to the error that the handler is not valid
The RuntimeClass "wasmedge" is invalid: handler: Invalid value: "io.containerd.wasmedge.v1": a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')
@corhere
When I can't have io.containerd.wasmedge.v1
as handler value which valid value can I pass to docker as runtime
that is explicit or implicit mapped to 'io.containerd.wasmedge.v1'? (to run this example on Kubernetes)
That leads to the error that the handler is not valid
That's unfortunate. I misread the Kubernetes docs and overlooked that dots are disallowed in handler strings. So there will need to be a way to configure aliases for shimv2 runtimes in the daemon to unblock this.
I've opened https://github.com/moby/moby/issues/45030 to track any necessary daemon-side work.
docker run --runtime io.containerd.kata.v2
works today with v23.0.1. This thread is the user request. The only thing hypothetical is support from cri-dockerd. Are there any other blockers on the daemon side aside from the annotation?
That is a blocker, yes.
The other blocker is "other than a single query for discussion six months ago, there was no work done on this because the amount of development time for it is small, and 'nice to have' questions for discussion about eventual implementation with no concrete use case don't get prioritized."
Patches welcome. When the daemon-side work is done would be a good time. Not to put too fine a point on it, but the speed of development here is often driven by the fact that there's only a very small amount of people writing patches, and the discrete needs (rather than asks, mostly) tend to get the focus.
moby/moby#45032 got out really fast (thanks!), but it's a poor solution for us, even after moby/moby#45032 is applied. cri-dockerd
doesn't have a meaningful way to actually figure out what the backing runtime is. api.types.Runtime
isn't surfaced in the client. cri-dockerd
can pass a bare string, but that's potentially just an alias, and we have no real idea whether or not we should add them at all, despite the ability to get/set them. It's still more or less blind.
It's true that that more or less matches the containerd experience, and it's a fine starting point, but it's also true that we have room to do a little bit better even by surfacing daemon.config.Config.Runtimes
up to the client
The daemon runtimes config is surfaced to the client at GET /info
.
# curl -s --unix-socket /var/run/docker.sock http://./v1.42/info | jq .Runtimes
{
"crun": {
"runtimeType": "io.containerd.runc.v2",
"options": {
"BinaryName": "/usr/local/bin/crun"
}
},
"runc": {
"path": "runc"
}
}
Yes, but speaking of potential RPC overhead, it's not exactly GraphQL, and there's a lot of fields to serialize/deserialize if we hit potentially that every single time a container is scheduled with some custom RuntimeClass
.
Or take the risk of caching it aside in cri-dockerd
and hoping that it stays valid.
any update?
I found pod.RuntimeClass
not set to non runc
, it always be runc
default-runtime is runc
, and runtime list:
# curl -s --unix-socket /var/run/docker.sock http://./v1.42/info | jq .Runtimes
{
"io.containerd.runc.v2": {
"path": "runc"
},
"nvidia": {
"path": "nvidia-container-runtime"
},
"runc": {
"path": "runc"
}
}
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: docker
---
apiVersion: v1
kind: Pod
metadata:
....
spec:
runtimeClassName: nvidia
$ docker inspect be4d | grep -i runtime
"Runtime": "runc",
Docker recently started supporting alternative runtimes with the
--runtime
flag. This allows users to use alternative runtimes to runc such as Kata (which I'd like to use on some pods due to additional benefits of kernel isolation on external facing applications). Additionally, cri-dockerd support was just merged into K3s, allowing users to continue to use Docker post upgrade to Kubernetes 1.24 and above.Unfortunately, those who choose to keep using Docker with Kubernetes will be constrained to the lack of alternative runtime support meaning native Kubernetes support in form of RuntimeClasses is made redundant as the CRI layer is a bottleneck.
I think adding support for passing through the requested runtime should not be too difficult so I was wondering if it's something that could be implemented?