[RFC] Image management proposal for hosting sharing and peer pods

jiangliu commented 1 year ago

Recently there are several proposals about image management for CoCo from the community, such as:

This is a refined version of [RFC] Proposal for Image Management on CoCo, aims to cover both host sharing and peer pods.

Scenarios

Confidentiality for manifest/config	Confidentiality for blob data	Solutions
No	No	This proposal
No	Yes	This proposal
Yes	Yes	CRI-proxy or containerd transfer service

Design

The core idea is to simplify the design by using kubelet/containerd Mount.Option field, instead of building new communication channel such as CRI-proxy or - [RFC] WIP: Proxyless Image-pull on guest proposal for peer pods

We add an extraoption field to Mount.Info by existing the existing mechanism from ExtraOption as follow:

// It's very expensive to build direct communication channel for extra information between the nydus snapshotter and
// kata-runtime/kata-agent/image-rs. So an extra mount option is used to pass information from nydus snapshotter to
// those components through containerd.
//
// The `Type` field determines the way to interpret other fields as below:
// Type: "nydus-image-fs"/""
// - Source: path the image meta blob
// - Config: nydusd configuration information
// - Snapshotdir: snapshot data directory
// - Version: RAFS filesystem version
// - Verity: unused
// Type: "nydus-image-block":
// - Source: path to the image meta blob
// - Config: nydusd configuration information
// - Snapshotdir: snapshot data directory
// - Version: unused
// - Verity: data verity information
// Type: "nydus-layer-block":
// - Source: path to the image layer blob
// - Config: nydusd configuration information
// - Snapshotdir: snapshot data directory
// - Version: unused
// - Verity: data verity information
// Type: "raw-image-block":
// - Source: path to the raw block image for the whole container image
// - Config: unsued
// - Snapshotdir: unused
// - Version: unused
// - Verity: data verity information
// Type: "raw-layer-block":
// - Source: path to the raw block image for a specific layer
// - Config: unused
// - Snapshotdir: unused
// - Version: unused
// - Verity: data verity information
// Type: "runtime-agent-pull"
// - Source: unused
// - Config: labels associated with the images, containing all labels from containerd
// - Snapshotdir: unused
// - Version: unused
// - Verity: unused
type ExtraOption struct {
    Type        string `json:"type"`
    Source      string `json:"source"`
    Config      string `json:"config"`
    Snapshotdir string `json:"snapshotdir"`
    Version     string `json:"fs_version"`
    Verity      string `json:"verity"`
}

The overall architecture looks like: 截屏2023-06-27 02 18 15

The above architecture works for both Kata and Coco as:

kubelet issue pull image requests to containerd
containerd downloads image manifest/config and then call remote snapshotter to process image layers
remote snapshotter optionally download and prepare data by itself, or relay the task to kata-agent, and return Mount array to containerd with the ExtraOption field.
containerd call kata-runtime to create containers, with the Mount array returned by the remote snapshotter
kata-runtime calls hypervisor plugin to process the Mount array. Different hypervisor plugin may have different capabilities. For example, qemu may only supports raw-image-block and raw-layer-block, Dragonball may support all of them. Hypervisor may share data from host to guest through virtio-blk/pmem/fs.
kata-runtime send creating container requests to kata-agent
kata agent may handle the ExtraOption itself, or call image-rs to handle it.

Proposal for Peer Pods

截屏2023-06-27 02 21 48

We need to introduce a remote snapshotter and enhance kata-runtime for Peer Pods.

Proposal for Host sharing through block device

截屏2023-06-27 02 24 37

Cooperation

This proposal can be combined with Image pulling on the host to define a generic protocol for Kata Containers, with support of virtio-blk/mem/fs and Kata 2.x/3.x.

wedsonaf commented 1 year ago

I like the idea of not adding additional communication channels.

What kind of things do you expect to be hypervisor-specific that led you to propose a hypervisor plugin in step 5? At a first glance, it seems like the generic fs/block-device sharing would suffice; I'm curious to hear what else would be needed.

jiangliu commented 1 year ago

FYI, there were effects to extend containerd Mount structure: https://github.com/containerd/containerd/issues/7055 https://github.com/containerd/containerd/pull/6746

huoqifeng commented 1 year ago

FYI, there were effects to extend containerd Mount structure: containerd/containerd#7055 containerd/containerd#6746

As my understanding, the requirements for containerd is not necessary for PeerPod case because, we can specify runc pod to use normal snapshotter and PeerPod to use remote-snapshotter if we don't use remote-snapshotter for runc pod on host or any kata pod on host. @jiangliu

stevenhorsman commented 1 year ago

Hey @jiangliu - thanks for this proposal - if I understand it correctly then it looks good. I guess my initial concerns are:

Is there a dependency on the containerd issue & PR you linked as they look quite old and I don't want to swap one containerd fork for another one!
Cri-o support as they don't have the remote snapshotter

On the cri-o support I might have a workaround that we can use in peer pods: If I understand correctly then the 'image-offload' approach works by adding optional ExtraOption message to the CreateContainer request. I'm assuming that these are pretty similar information wise to the PullImageRequest we currently have as a separate endpoint and will do some pretty similar logic to the current image pull in CCv0. Assuming this is correct then it seems like the only thing extra that the remote snapshotter would do would result in appending those ExtraOption for image offload to the CreateContainerRequest body. This makes we wonder if, (similar to the proxyless proposal) in the cloud-api-adaptor's logic, before we send the CreateContainer request to the kata-agent we could look at it and check if it had the ExtraOption set and if not append it and send it on the kata-agent and in this way the remote-snapshotter and cloud-api-adaptor without snapshotter would result in the same message to the kata-agent.

Does that make sense, or have I got something wrong?

huoqifeng commented 1 year ago

I drew the flow of the case for containerd and cri-o as below after discuss with @stevenhorsman @magowan and @jiangliu:

littlejawa commented 1 year ago

Regarding the cri-o case: I think it can actually be closer to the containerd workflow. The fact is that we already have kata-specific code in cri-o itself to manage the kata runtime. Changing this code to add some extra options to the CreateContainer request is pretty easy, and would probably be accepted by the cri-o community as it would not change the behaviour of any other containers - only kata ones.

The proposal we are working on, suggested by the cri-o maintainers, is to let cri-o pull the image in a separate volume, then share this volume to the remote VM that can do attestation on the container image, and use it. We could use this ExtraOption mechanism to provide the needed information for mounting this remote volume. It would require limited change to cri-o and/or the kata runtime to setup the volume share and add the ExtraOptions, but then I think we can use the current proposal pretty much as is.

Here is what it would look like - please share any comment

ImagePull-Proposal-CRIO drawio

jiangliu commented 1 year ago

I like the idea of not adding additional communication channels.

What kind of things do you expect to be hypervisor-specific that led you to propose a hypervisor plugin in step 5? At a first glance, it seems like the generic fs/block-device sharing would suffice; I'm curious to hear what else would be needed.

For example, we implement image data lazy loading in virtio-fs/blk backend drivers. Or encrypt/decrypt cached image data in virtio backend drivers.

jiangliu commented 1 year ago

Regarding the cri-o case: I think it can actually be closer to the containerd workflow. The fact is that we already have kata-specific code in cri-o itself to manage the kata runtime. Changing this code to add some extra options to the CreateContainer request is pretty easy, and would probably be accepted by the cri-o community as it would not change the behaviour of any other containers - only kata ones.

The proposal we are working on, suggested by the cri-o maintainers, is to let cri-o pull the image in a separate volume, then share this volume to the remote VM that can do attestation on the container image, and use it. We could use this ExtraOption mechanism to provide the needed information for mounting this remote volume. It would require limited change to cri-o and/or the kata runtime to setup the volume share and add the ExtraOptions, but then I think we can use the current proposal pretty much as is.

Here is what it would look like - please share any comment

That's almost the same way for containerd + nydus-snapshotter case. The nydus-snapshotter actually executes three commands to generate a raw disk image from image layer tar files as:

Create an EROFS filesystem from each layer tar file

nydus-image create --type tar-tarfs --bootstrap /var/lib/containerd-nydus/snapshots/7/fs/image/layer.boot.tarfs.tmp --blob-id 4a85ce26214d83c77b5464631a67c71e1c2793b655261befe52ba0e20ffc3bd1 --blob-dir /var/lib/containerd-nydus/cache /var/lib/containerd-nydus/snapshots/7/fs/layer_7_tar.fifo

Merge multiple EROFS instances for layers into a final EROFS instance for the whole image

nydus-image merge --bootstrap /var/lib/containerd-nydus/snapshots/7/fs/image/image.boot.tarfs.tmp /var/lib/containerd-nydus/snapshots/1/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/2/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/3/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/4/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/5/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/6/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/7/fs/image/layer.boot

Export the final EROFS filesystem as a raw disk image with dm-verity info

nydus-image export --block --localfs-dir /var/lib/containerd-nydus/cache --bootstrap /var/lib/containerd-nydus/snapshots/7/fs/image/image.boot --output /var/lib/containerd-nydus/cache/4a85ce26214d83c77b5464631a67c71e1c2793b655261befe52ba0e20ffc3bd1.image.disk.tarfs.tmp --verity
nydus image export command, stdout: dm-verity options: --no-superblock --format=1 -s \"\" --hash=sha256 --data-block-size=512 --hash-block-size=4096 --data-blocks 379918 --hash-offset 194519040 5676d0482c2128e1ae1f59d4b1f253de57158856471b7b5a1bfc0501823bd632

That's one way to generate a separate volume:) And it's possible to integrate these functionality into erofs-utils package in future.

jiangliu commented 1 year ago

Hey @jiangliu - thanks for this proposal - if I understand it correctly then it looks good. I guess my initial concerns are:

Is there a dependency on the containerd issue & PR you linked as they look quite old and I don't want to swap one containerd fork for another one!

No, it doesn't depend on those two issues. That's only one more step to formalize the ExtraOptions in containerd community .

Cri-o support as they don't have the remote snapshotter

On the cri-o support I might have a workaround that we can use in peer pods: If I understand correctly then the 'image-offload' approach works by adding optional ExtraOption message to the CreateContainer request. I'm assuming that these are pretty similar information wise to the PullImageRequest we currently have as a separate endpoint and will do some pretty similar logic to the current image pull in CCv0. Assuming this is correct then it seems like the only thing extra that the remote snapshotter would do would result in appending those ExtraOption for image offload to the CreateContainerRequest body. This makes we wonder if, (similar to the proxyless proposal) in the cloud-api-adaptor's logic, before we send the CreateContainer request to the kata-agent we could look at it and check if it had the ExtraOption set and if not append it and send it on the kata-agent and in this way the remote-snapshotter and cloud-api-adaptor without snapshotter would result in the same message to the kata-agent.

I feel we have a solution for it now:)

Does that make sense, or have I got something wrong?

fidencio commented 1 year ago

@jiangliu, @arronwy,

I really would like to have this clear as crystal, does this solution, for the local case, depends on virtio-fs / virtio-9p? If so, why?

Depending on virtio-fs / virtio-9p would be a no-go for us, as it'd generate issues with:

all the current TEE approaches, as virtio-fs is not supported;
using Cloud Hypervisor, as 9p support will never ever be implemented there;
- and, realistically speaking, should not even be considered to be used as a solution as modern distros have been dropping it, and it's always been a security nightmare

I just would like to make sure this is well known, and this is taken into consideration when deciding on the path we'll be taking.

wedsonaf commented 1 year ago

Regarding the cri-o case: I think it can actually be closer to the containerd workflow. The fact is that we already have kata-specific code in cri-o itself to manage the kata runtime. Changing this code to add some extra options to the CreateContainer request is pretty easy, and would probably be accepted by the cri-o community as it would not change the behaviour of any other containers - only kata ones.

The proposal we are working on, suggested by the cri-o maintainers, is to let cri-o pull the image in a separate volume, then share this volume to the remote VM that can do attestation on the container image, and use it. We could use this ExtraOption mechanism to provide the needed information for mounting this remote volume. It would require limited change to cri-o and/or the kata runtime to setup the volume share and add the ExtraOptions, but then I think we can use the current proposal pretty much as is.

Here is what it would look like - please share any comment

If you're going to pull from the host in cri-o, you may as well do the same in containerd. This looks very similar to what I proposed in slack (https://cloud-native.slack.com/archives/C050S70D9SR/p1686914484876879?thread_ts=1682361194.716389&cid=C050S70D9SR) for the remote-snapshotter.

Once there's a way to expose a disk to a remote VM, the snapshotter can use this functionality as well and the two workflows will be almost identical.

huoqifeng commented 1 year ago

I think share volume on host to nested VM is more easier than remote VM, what about we set it as an enhancement direction for CRI-O case for remote VM, and target the 1st goal as below: @jiangliu @stevenhorsman @littlejawa @wedsonaf @fidencio

jiangliu commented 1 year ago

@jiangliu, @arronwy,

I really would like to have this clear as crystal, does this solution, for the local case, depends on virtio-fs / virtio-9p? If so, why?

Depending on virtio-fs / virtio-9p would be a no-go for us, as it'd generate issues with:

all the current TEE approaches, as virtio-fs is not supported;

using Cloud Hypervisor, as 9p support will never ever be implemented there;

and, realistically speaking, should not even be considered to be used as a solution as modern distros have been dropping it, and it's always been a security nightmare

I just would like to make sure this is well known, and this is taken into consideration when deciding on the path we'll be taking.

This proposal are designed for two cases: 1) Sharing images on host for Coco. It does not depend on virtio-fs/9p. 2) Providing images for normal Kata containers. All of virtio-fs/virtio-blk/virtio-pmem will be supported. We prefer virtio-fs/virtio-pmem due to DAX, which is critical for our use cases..

jiangliu commented 1 year ago

I think share volume on host to nested VM is more easier than remote VM, what about we set it as an enhancement direction for CRI-O case for remote VM, and target the 1st goal as below: @jiangliu @stevenhorsman @littlejawa @wedsonaf @fidencio

Yeah, creating a volume for peer pods may need to invoke cloud OpenAPI with a complex flow or multi-attach capable cloud storage.

wedsonaf commented 1 year ago

Another point I'd like clarify: none of this should be specific to Nydus as the comments in the ExtraOption snippet seem to suggest.

I think we should have generic options prefixed with io.katacontainers like in https://github.com/kata-containers/kata-containers/pull/7106 -- this allows for a generic implementation without the need for snapshotter-specific code sprinkled throughout the codebase.

jiangliu commented 1 year ago

Another point I'd like clarify: none of this should be specific to Nydus as the comments in the ExtraOption snippet seem to suggest.

I think we should have generic options prefixed with io.katacontainers like in kata-containers/kata-containers#7106 -- this allows for a generic implementation without the need for snapshotter-specific code sprinkled throughout the codebase.

I feel that's not a blocking issue. Currently the extension is nydus specific, we should cooperate to make it an extension to Kata, with neutral prefix, if we can make the extension generic enough.

wainersm commented 1 year ago

@jiangliu I finally have some time to look at this proposal. In particular I am interested on what tests we will be pulling from CCv0 and what will be implemented.

I would like to understand how the sharing images on host for Coco case is going to deal with encrypted images. The images layers will be mount as encrypted disks on the guest, that's it? Then kata-agent delegates to image-rs the decryption of the layers?

confidential-containers / confidential-containers