Open jiangliu opened 1 year ago
I like the idea of not adding additional communication channels.
What kind of things do you expect to be hypervisor-specific that led you to propose a hypervisor plugin in step 5? At a first glance, it seems like the generic fs/block-device sharing would suffice; I'm curious to hear what else would be needed.
FYI, there were effects to extend containerd Mount structure: https://github.com/containerd/containerd/issues/7055 https://github.com/containerd/containerd/pull/6746
FYI, there were effects to extend containerd Mount structure: containerd/containerd#7055 containerd/containerd#6746
As my understanding, the requirements for containerd is not necessary for PeerPod case because, we can specify runc pod to use normal snapshotter and PeerPod to use remote-snapshotter if we don't use remote-snapshotter for runc pod on host or any kata pod on host. @jiangliu
Hey @jiangliu - thanks for this proposal - if I understand it correctly then it looks good. I guess my initial concerns are:
On the cri-o support I might have a workaround that we can use in peer pods:
If I understand correctly then the 'image-offload' approach works by adding optional ExtraOption
message to the CreateContainer
request. I'm assuming that these are pretty similar information wise to the PullImageRequest
we currently have as a separate endpoint and will do some pretty similar logic to the current image pull in CCv0
. Assuming this is correct then it seems like the only thing extra that the remote snapshotter would do would result in appending those ExtraOption
for image offload to the CreateContainerRequest body. This makes we wonder if, (similar to the proxyless proposal) in the cloud-api-adaptor's logic, before we send the CreateContainer
request to the kata-agent we could look at it and check if it had the ExtraOption
set and if not append it and send it on the kata-agent and in this way the remote-snapshotter and cloud-api-adaptor without snapshotter would result in the same message to the kata-agent.
Does that make sense, or have I got something wrong?
I drew the flow of the case for containerd and cri-o as below after discuss with @stevenhorsman @magowan and @jiangliu:
Regarding the cri-o case: I think it can actually be closer to the containerd workflow. The fact is that we already have kata-specific code in cri-o itself to manage the kata runtime. Changing this code to add some extra options to the CreateContainer request is pretty easy, and would probably be accepted by the cri-o community as it would not change the behaviour of any other containers - only kata ones.
The proposal we are working on, suggested by the cri-o maintainers, is to let cri-o pull the image in a separate volume, then share this volume to the remote VM that can do attestation on the container image, and use it. We could use this ExtraOption mechanism to provide the needed information for mounting this remote volume. It would require limited change to cri-o and/or the kata runtime to setup the volume share and add the ExtraOptions, but then I think we can use the current proposal pretty much as is.
Here is what it would look like - please share any comment
I like the idea of not adding additional communication channels.
What kind of things do you expect to be hypervisor-specific that led you to propose a hypervisor plugin in step 5? At a first glance, it seems like the generic fs/block-device sharing would suffice; I'm curious to hear what else would be needed.
For example, we implement image data lazy loading in virtio-fs/blk backend drivers. Or encrypt/decrypt cached image data in virtio backend drivers.
Regarding the cri-o case: I think it can actually be closer to the containerd workflow. The fact is that we already have kata-specific code in cri-o itself to manage the kata runtime. Changing this code to add some extra options to the CreateContainer request is pretty easy, and would probably be accepted by the cri-o community as it would not change the behaviour of any other containers - only kata ones.
The proposal we are working on, suggested by the cri-o maintainers, is to let cri-o pull the image in a separate volume, then share this volume to the remote VM that can do attestation on the container image, and use it. We could use this ExtraOption mechanism to provide the needed information for mounting this remote volume. It would require limited change to cri-o and/or the kata runtime to setup the volume share and add the ExtraOptions, but then I think we can use the current proposal pretty much as is.
Here is what it would look like - please share any comment
That's almost the same way for containerd + nydus-snapshotter case. The nydus-snapshotter actually executes three commands to generate a raw disk image from image layer tar files as:
nydus-image create --type tar-tarfs --bootstrap /var/lib/containerd-nydus/snapshots/7/fs/image/layer.boot.tarfs.tmp --blob-id 4a85ce26214d83c77b5464631a67c71e1c2793b655261befe52ba0e20ffc3bd1 --blob-dir /var/lib/containerd-nydus/cache /var/lib/containerd-nydus/snapshots/7/fs/layer_7_tar.fifo
nydus-image merge --bootstrap /var/lib/containerd-nydus/snapshots/7/fs/image/image.boot.tarfs.tmp /var/lib/containerd-nydus/snapshots/1/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/2/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/3/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/4/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/5/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/6/fs/image/layer.boot /var/lib/containerd-nydus/snapshots/7/fs/image/layer.boot
nydus-image export --block --localfs-dir /var/lib/containerd-nydus/cache --bootstrap /var/lib/containerd-nydus/snapshots/7/fs/image/image.boot --output /var/lib/containerd-nydus/cache/4a85ce26214d83c77b5464631a67c71e1c2793b655261befe52ba0e20ffc3bd1.image.disk.tarfs.tmp --verity
nydus image export command, stdout: dm-verity options: --no-superblock --format=1 -s \"\" --hash=sha256 --data-block-size=512 --hash-block-size=4096 --data-blocks 379918 --hash-offset 194519040 5676d0482c2128e1ae1f59d4b1f253de57158856471b7b5a1bfc0501823bd632
That's one way to generate a separate volume:) And it's possible to integrate these functionality into erofs-utils package in future.
Hey @jiangliu - thanks for this proposal - if I understand it correctly then it looks good. I guess my initial concerns are:
- Is there a dependency on the containerd issue & PR you linked as they look quite old and I don't want to swap one containerd fork for another one!
No, it doesn't depend on those two issues.
That's only one more step to formalize the ExtraOptions
in containerd community .
- Cri-o support as they don't have the remote snapshotter
On the cri-o support I might have a workaround that we can use in peer pods: If I understand correctly then the 'image-offload' approach works by adding optional
ExtraOption
message to theCreateContainer
request. I'm assuming that these are pretty similar information wise to thePullImageRequest
we currently have as a separate endpoint and will do some pretty similar logic to the current image pull inCCv0
. Assuming this is correct then it seems like the only thing extra that the remote snapshotter would do would result in appending thoseExtraOption
for image offload to the CreateContainerRequest body. This makes we wonder if, (similar to the proxyless proposal) in the cloud-api-adaptor's logic, before we send theCreateContainer
request to the kata-agent we could look at it and check if it had theExtraOption
set and if not append it and send it on the kata-agent and in this way the remote-snapshotter and cloud-api-adaptor without snapshotter would result in the same message to the kata-agent.
I feel we have a solution for it now:)
Does that make sense, or have I got something wrong?
@jiangliu, @arronwy,
I really would like to have this clear as crystal, does this solution, for the local case, depends on virtio-fs / virtio-9p? If so, why?
Depending on virtio-fs / virtio-9p would be a no-go for us, as it'd generate issues with:
I just would like to make sure this is well known, and this is taken into consideration when deciding on the path we'll be taking.
Regarding the cri-o case: I think it can actually be closer to the containerd workflow. The fact is that we already have kata-specific code in cri-o itself to manage the kata runtime. Changing this code to add some extra options to the CreateContainer request is pretty easy, and would probably be accepted by the cri-o community as it would not change the behaviour of any other containers - only kata ones.
The proposal we are working on, suggested by the cri-o maintainers, is to let cri-o pull the image in a separate volume, then share this volume to the remote VM that can do attestation on the container image, and use it. We could use this ExtraOption mechanism to provide the needed information for mounting this remote volume. It would require limited change to cri-o and/or the kata runtime to setup the volume share and add the ExtraOptions, but then I think we can use the current proposal pretty much as is.
Here is what it would look like - please share any comment
If you're going to pull from the host in cri-o, you may as well do the same in containerd. This looks very similar to what I proposed in slack (https://cloud-native.slack.com/archives/C050S70D9SR/p1686914484876879?thread_ts=1682361194.716389&cid=C050S70D9SR) for the remote-snapshotter.
Once there's a way to expose a disk to a remote VM, the snapshotter can use this functionality as well and the two workflows will be almost identical.
I think share volume on host to nested VM is more easier than remote VM, what about we set it as an enhancement direction for CRI-O
case for remote VM, and target the 1st goal as below: @jiangliu @stevenhorsman @littlejawa @wedsonaf @fidencio
@jiangliu, @arronwy,
I really would like to have this clear as crystal, does this solution, for the local case, depends on virtio-fs / virtio-9p? If so, why?
Depending on virtio-fs / virtio-9p would be a no-go for us, as it'd generate issues with:
- all the current TEE approaches, as virtio-fs is not supported;
using Cloud Hypervisor, as 9p support will never ever be implemented there;
- and, realistically speaking, should not even be considered to be used as a solution as modern distros have been dropping it, and it's always been a security nightmare
I just would like to make sure this is well known, and this is taken into consideration when deciding on the path we'll be taking.
This proposal are designed for two cases: 1) Sharing images on host for Coco. It does not depend on virtio-fs/9p. 2) Providing images for normal Kata containers. All of virtio-fs/virtio-blk/virtio-pmem will be supported. We prefer virtio-fs/virtio-pmem due to DAX, which is critical for our use cases..
I think share volume on host to nested VM is more easier than remote VM, what about we set it as an enhancement direction for
CRI-O
case for remote VM, and target the 1st goal as below: @jiangliu @stevenhorsman @littlejawa @wedsonaf @fidencio
Yeah, creating a volume for peer pods may need to invoke cloud OpenAPI with a complex flow or multi-attach capable cloud storage.
Another point I'd like clarify: none of this should be specific to Nydus as the comments in the ExtraOption
snippet seem to suggest.
I think we should have generic options prefixed with io.katacontainers
like in https://github.com/kata-containers/kata-containers/pull/7106 -- this allows for a generic implementation without the need for snapshotter-specific code sprinkled throughout the codebase.
Another point I'd like clarify: none of this should be specific to Nydus as the comments in the
ExtraOption
snippet seem to suggest.I think we should have generic options prefixed with
io.katacontainers
like in kata-containers/kata-containers#7106 -- this allows for a generic implementation without the need for snapshotter-specific code sprinkled throughout the codebase.
I feel that's not a blocking issue. Currently the extension is nydus specific, we should cooperate to make it an extension to Kata, with neutral prefix, if we can make the extension generic enough.
@jiangliu I finally have some time to look at this proposal. In particular I am interested on what tests we will be pulling from CCv0 and what will be implemented.
I would like to understand how the sharing images on host for Coco
case is going to deal with encrypted images. The images layers will be mount as encrypted disks on the guest, that's it? Then kata-agent delegates to image-rs the decryption of the layers?
Recently there are several proposals about image management for CoCo from the community, such as:
This is a refined version of [RFC] Proposal for Image Management on CoCo, aims to cover both host sharing and peer pods.
Scenarios
Design
The core idea is to simplify the design by using kubelet/containerd
Mount.Option
field, instead of building new communication channel such as CRI-proxy or - [RFC] WIP: Proxyless Image-pull on guest proposal for peer podsWe add an
extraoption
field toMount.Info
by existing the existing mechanism from ExtraOption as follow:The overall architecture looks like:
The above architecture works for both Kata and Coco as:
Mount
array to containerd with theExtraOption
field.Mount
array returned by the remote snapshotterMount
array. Different hypervisor plugin may have different capabilities. For example, qemu may only supportsraw-image-block
andraw-layer-block
, Dragonball may support all of them. Hypervisor may share data from host to guest through virtio-blk/pmem/fs.ExtraOption
itself, or call image-rs to handle it.Proposal for Peer Pods
We need to introduce a remote snapshotter and enhance kata-runtime for Peer Pods.
Proposal for Host sharing through block device
Cooperation
This proposal can be combined with Image pulling on the host to define a generic protocol for Kata Containers, with support of virtio-blk/mem/fs and Kata 2.x/3.x.