containers / podman

Podman: A tool for managing OCI containers and pods.
https://podman.io
Apache License 2.0
23.22k stars 2.37k forks source link

[Feature]: Implement the shimv2 protocol into C to support newer versions of Kata Containers #17070

Closed raballew closed 3 months ago

raballew commented 1 year ago

Feature request description

Bottom line for Podman is it knows how to exec an OCI runtime based on runc. It can work with runc, crun, krun, gVisor and kata (v1). No one has implemented an OCI runtime which talks shimv2 protocol leaving users with old versions of Kata Containers. The usage of these versions is not supported anymore and highly discouraged by the maintainers (https://github.com/kata-containers/kata-containers/issues/722#issuecomment-842556935).

The goal is to make the first step to achieve parity in Kata Container workflows where podman is used locally during development and deployed on a cluster for production purposes.

Suggest potential solution

No response

Have you considered any alternatives?

The only potential alternative is using Kata Containers v1 which still supports the earlier CRI interface as described here https://github.com/kata-containers/kata-containers/issues/1133#issuecomment-731153881

There is also an open issue to re-add OCI CLI command for Podman at Kata Containers but there seems to be no progress so far (https://github.com/kata-containers/kata-containers/issues/722).

Additionally Kata Containers v1 is not supported anymore and usage is highly discouraged by its maintainers leaving no viable alternative left.

Additional context

shimv2 API is similar but not equal to the OCI runtime API as it provides different verbs during the container life cycle. While both are used to manage containers, the value added by shimv2 is that it can launch both Pods (multiple containers) and OCI compatible containers with a single runtime shim.

Due to the fact the shimv2 is not a standard and quite specific to containerd it was decided to not implement special case code in Podman as described here https://github.com/containers/podman/issues/8579

giuseppe commented 1 year ago

IMO this should be done with a different program that behaves like an OCI runtime and it talks internally to kata using the shimv2 protocol. In this way it can be used transparently with Podman without requiring any change.

I would not add the shimv2 protocol to Podman itself for a couple of reasons: 1) it adds more dependencies that affect the binary size, and 2) it is a containerd thing, it is not a standard.

mheon commented 1 year ago

I tend to agree. The Kata runtime seems like it could provide an OCI-compatible CLI interface, which would resolve this problem.

rhatdan commented 1 year ago

Os someone create a CLI Library that can talk to shim V2 and use crun to parse the OCI Runtime Specification. Link them together and create crun-kata.

adrecord commented 1 year ago

From an outsider's perspective, I agree that it makes sense for this to be a separate program that presents an OCI compliant runtime, but talks shimv2. Some comments (1, 2) on the kata issue tracker indicate that @c3d and @dgibson had been working on it, at least in the past.

Like others, I was burned by kata's breaking change to no longer support the OCI runtime spec. I'm now using krun (crun built with libkrun support), which gives me similar VM isolation via an OCI runtime, the way kata 1.x worked. If anyone sees this issue wishing podman worked with kata, you may wanna give that a try.

c3d commented 1 year ago

This is the fourth "notification" I receive about Kata Containers + podman this week, Probably time to revive my effort on writing an OCI-compliant wrapper for the shimv2 interface (what we called "ociplex" in the past).

dgibson commented 1 year ago

This is the fourth "notification" I receive about Kata Containers + podman this week, Probably time to revive my effort on writing an OCI-compliant wrapper for the shimv2 interface (what we called "ociplex" in the past).

Yeah, might be. I really have zero interest in working on this, but you're welcome to the scant bits of made when I was working on it.

Like others, I was burned by kata's breaking change to no longer support the OCI runtime spec. I'm now using krun (crun built with libkrun support), which gives me similar VM isolation via an OCI runtime, the way kata 1.x worked. If anyone sees this issue wishing podman worked with kata, you may wanna give that a try.

Overall, I think libkrun has a considerably better thought out design than Kata. Note however that there is a difference in model: Kata works on a one-VM-per-pod model whereas libkrun works on a one-VM-per-container model. libkrun's (very clever) in-kernel socket interception stuff makes it possible to do that with mostly similar semantics, but there are likely quite a few edge cases where they will behave differently. Perhaps more importantly, the performance characteristics may be significantly different between the two for pods with more than one container.

c3d commented 1 year ago

I spent a little bit of additional time on this today. Here is the summary of my findings:

  1. The youki project, which @dgibson and I started with, does not seem to support rootless yet. I don't get the exact same message, instead I get OCI runtime error: youki: Error: metadata for /run/youki does not possess the expected attributes. With root, I got it working.
  2. Interfacing with the shimv2 interface requires support for Google protobuf. There is a C version of protobuf, weirdly enough, so that seems like a good starting point. It took me a little while to get the right sequence of include paths to get it to build the protobuf interface for the shim, but I got this working.
  3. The part that concerns me a bit is that the shimv2 interface mandates a --publish-binary option to publish events. It's not very clear to me how that particular part of the protocol works, but it seems asynchronous in nature, and therefore at odds with the existing command-line interface. My guess is that the binary will have to pass itself as the --publish-binary and wait for events to turn an async protocol into a synchronous one. So unlike the krun case, I'm not entirely sure we can do that with a share lib that we load on demand, although the crun option parsing seems flexible enough to allow me to use crun as the binary. This needs further investigation.
rhatdan commented 1 year ago

@giuseppe @flouthoc FYI

dgibson commented 1 year ago

I spent a little bit of additional time on this today. Here is the summary of my findings:

1. The `youki` project, which @dgibson and I started with, does not seem to [support rootless yet](https://github.com/containers/youki/pull/1171). I don't get the exact same message, instead I get `OCI runtime error: youki: Error: metadata for /run/youki does not possess the expected attributes`. With root, I got it working.

Not all that surprising. Note that I wasn't anticipating using the guts of youki at all - pretty much just its command line parsing code (which is pretty well contained, particularly after some changes I pushed).

2. Interfacing with the shimv2 interface requires support for Google protobuf. There is a [C version of protobuf](https://github.com/protobuf-c/protobuf-c), weirdly enough, so that seems like a good starting point. It took me a little while to get the right sequence of include paths to get it to build the protobuf interface for the shim, but I got this working.

AIUI, the Kata agent already uses Google protobuf / gRPC to talk to the agent. We should be able to use the same library to talk shimv2.

3. The part that concerns me a bit is that the [shimv2 interface](https://github.com/containerd/containerd/tree/main/runtime/v2) mandates a `--publish-binary` option to publish events. It's not very clear to me how that particular part of the protocol works, but it seems asynchronous in nature, and therefore at odds with the existing command-line interface. My guess is that the binary will have to pass itself as the `--publish-binary` and wait for events to turn an async protocol into a synchronous one. 

Well.. it's "mandated" in the sense that a shimv2 server must support the option but not, AFAICT, in the sense that a shimv2 client must use it. We should be in the latter case, so I think we can just ignore it.

 So unlike the `krun` case, I'm not entirely sure we can do that with a share lib that we load on demand, although the `crun` option parsing seems flexible enough to allow me to use `crun` as the binary. This needs further investigation.
github-actions[bot] commented 1 year ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 1 year ago

Any updates?

c3d commented 1 year ago

Well, I have mostly focused on reviving ociplex, which is a different approach using a Rust wrapper derived from youki. At least for now, the C variant is on hold.

On the ociplex side, I implemented the totality fo the CLI parsing for documented runc commands, as well as one that is not publicly documented (runc features), and the symmetric "CLI backend". I also implemented the skeleton of the ShimV2 back-end, and found a crate that seems to be doing most of what I want.

I also ran various experiments to try and figure the exact interaction between containerd and the runtimes, because I saw a number of differences testing with ctr, notably when there is an error along the way. At the moment, for example, I am a bit at a loss regarding the correct way to cleanup if something fails during container start. In that scenario, Kata Containers seems to leave some processes running behind (qemu and virtiofsd) but I found no proper way to shut them down other than killing them.

I also tried unsuccessfully to strace the runtime to check which files and sockets were opened, and in which order. This is because in the experiments described above, there was at least one case where it seemed like the runtime engine did not consistently provide the socket file path to the shim v2, and I wanted to see where the defaults came from. I was unsuccessful because if I write a script that wraps the runtime, things work fine, but if that script runs strace on the process, then everything seems to hang and it's unclear why. The strace files show several processes that appear to be in a non-busy wait loop (e.g. I will see repeated epoll or futex syscalls), but I don't know exactly what they are trying to do. Clearly, as a user, I can't interact with the container (even after waiting for hours).

The current work is on implementing the shim v2 interface from ociplex and see where that leads.

struanb commented 1 year ago

I'm sharing the following in case it's helpful to anyone who, like us, needed VM isolation for their container workloads, a virtiofs-based solution (ruling out Kata v1) and compatibility with docker run or podman run (ruling out Kata v2).

Due to issues like this one, experienced using Docker/Podman CLI to launch Kata Containers, we built RunCVM (Run Container VM): an experimental open-source Docker container runtime, for launching standard container workloads in VMs.

Please note that RunCVM is not a direct competitor to Kata: as an experimental runtime, RunCVM cannot offer the same levels of stability and support as Kata. However RunCVM may be suitable for some use cases and is compatible with docker run today (with experimental support for podman run). Like Kata v2, RunCVM is also virtiofs-based for speed. RunCVM has minimal system dependencies: it relies on the Linux KVM module, and can even be installed in a GitHub Codespace.

rhatdan commented 1 year ago

Does it support same CLI as RUNC and CRUN?

struanb commented 1 year ago

Does it support same CLI as RUNC and CRUN?

Yes, it piggybacks RUNC to make KVM launch inside a container.

Luap99 commented 3 months ago

I close this as I really do not see us doing this in podman anyway. If someone makes a cli compatible oci runtime it should work without podman changes

marc-gizmo commented 2 months ago

Hello, i'm confused about the KATA/Podman developpement, and the documentation.

If i understand correctly, the Kata runtime will not be supported anymore, at least directly, by podman But can i use the CRI interface with podman to run Kata-containers ?

My understanding from https://docs.redhat.com/en/documentation/openshift_container_platform/3.11/html/cri-o_runtime/use-crio-engine is that i can use podman with Kata using the following stack :

Podman -> CRI-O -> KATA

The following part of the documentation seem to imply that it would work :

There is little need for direct command-line contact with CRI-O. However, to provide full access to CRI-O for testing and monitoring, and to provide features you expect with Docker that CRI-O does not offer, a set of container-related command-line tools are available. These tools replace and extend what is available with the docker command and service. Tools include:

crictl - For troubleshooting and working directly with CRI-O container engines runc - For running container images podman - For managing pods and container images (run, stop, start, ps, attach, exec, etc.) outside of the container engine

I understand that CRI-O officially support KATA runtime ; what confuse me, is the relation between CRI-O and podman (and maybe the overlap/complementarity of crictl/runc tools).

If Podman does not support CRI-O in the way i understood, should podman be dropped completly in favor of containerd or crictl to run kata containers ?

Regards, Marc

mheon commented 2 months ago

You are incorrect. Podman does not use CRI-O as a backend; we have our own backend code. Podman cannot interact with the CRI at all. We interact with OCI runtimes directly. Until Kata creates an OCI compatible runtime, Podman will not be able to support it.

rhatdan commented 2 months ago

If you want similar functionality to Kata take a look at krun.

struanb commented 2 months ago

Thanks for the krun tip, it looks interesting.

Also for similar functionality to Kata, look at https://github.com/newsnowlabs/runcvm, which has experimental support for Podman.