Add support Edit Container Mount for kata/runtime-rs using direct volume in K8S/CSI scenario

Apokleos commented 1 year ago

In the Kubernetes/CSI scenario, users need to use CSI to provide storage volumes to regular containers. After CSI completes the relevant operations, kubelet provides a HostPath and ContainerPath, but does not include a Mount Type. In other words, the Container Mount information that Containerd can retrieve through CRI does not include a type. When converting to the OCI Spec, the Mount type is set to "bind" by default.

Unfortunately,this solution does not work properly in the Kata Containers 3.0/runtime-rs with Direct Volume scenario. When using a direct volume, Kata Pod needs to determine the type of DirectVolume based on Mount.Type after obtaining mount information. Then, it needs to choose the appropriate way to insert the backend device into the Guest, such as directvol, vfiovol, or spdkvol. Finally, the mount operation is completed in the Guest.

Kata3.0/runtime-rs using direct volume with containerd ctr, more details can be seen how-to-run-kata-containers-with-kinds-of-Block-Volumes An example as below:

$ cat ./kubelet/kata-direct-vol-002/directvol002/mountInfo.json
{
  "device": "/tmp/stor/rawdisk01.20g", 
  "volume_type": "directvol", 
  "fs_type": "ext4", 
  "metadata":"{}", 
  "options": []
}
$ sudo ctr run -t --rm --runtime io.containerd.kata.v2 --mount type=directvol,src=/kubelet/kata-direct-vol-002/directvol002,dst=/disk002,options=rbind:rw "$image" kata-direct-vol-xx05302045 /bin/bash

However, there are certain differences between ordinary containers and Kata Containers when using CSI to access storage. The Mount Type in the OCI Spec provided by K8S/Containerd is by default the bind type. Ordinary containers will use the bind type by default, but Kata/DirectVolume requires a specific type, such as "directvol" (or other types like vfiovol or spdkvol). Therefore, in the K8S/CSI scenario, Kata Containers wants the Mount in its OCI Spec to contain a specific volume type instead of the bind type when creating a container that uses a Direct Volume. In other words, it is necessary to change the default bind type to the specific volume type at the right time, and then pass it to the Kata runtime-rs.

Although CDI can edit the OCI Spec of Container third-party devices in K8S/DevicePlugin using Container Annotation/CDIDevice information, it can also be extended to CSI to increase the editing ability of Container Mount.

In the K8S/CSI/Kata-runtime-rs scenario using Direct Volume, we hope that CDI can automatically edit a Mount in the OCI Spec according to the Direct Volume type, that is, to change the original "bind" to the "directvol" of the corresponding volume_type in the mountInfo.json instance.

According to the idea, I have already made the initial design and implementation of it, the scheme is as follows.

CSI (1) Do kata-ctl direct-volume add DoAddDirectVolume(mount.HostPath) (2) Do cdi config generation in /var/run/cdi/xxx.json DoGenerateCDIConfig(mount.HostPath, mount.ContainerPath)

containerd Set the special vend/class for direct volume to direct.volume/direct-volume. (1) Add the parameter CDIMounts []*runtime.Mount to the WithCDI function. (2) Use the base64-encoded string of Mount.HostPath as the deviceName to build a mount device that conforms to the CDI specification.

encodedMntpath := DoEncodeBase64(mount.HostPath)
mountDevice := "direct.volume/direct-volume=" + encodedMntpath

The cdi config for Mount as below:


# cat /var/run/cdi/kata-cdi-L3Zhci9saWIv.json
{
"cdiVersion": "0.6.0",
"kind": "direct.volume/direct-volume",
"devices": [
{
  "name": "L3Zhci9saWIva3ViZWxldC9wb2RzLzBhNmY5OGQ4LWM4YjgtNDgwOS05ZTZlLTEwMWJlNTA0MTM2MS92b2x1bWVzL2t1YmVybmV0ZXMuaW9+ZW1wdHktZGlyL2NkaS1kaXJlY3QteHZvbHg5MDAxMA==",
  "containerEdits": {
    "mounts": [
      {
        "hostPath": "/var/lib/kubelet/pods/0a6f98d8-c8b8-4809-9e6e-101be5041361/volumes/kubernetes.io~csi/cdi-direct-xvolx90010",
        "containerPath": "/disk01",
        "options": [
          "rbind",
          "rw"
        ],
        "type": "directvol"
      }
    ]
  }
}
]
}


And the draft code as below:

// WithCDI updates OCI spec with CDI content

func WithCDI(annotations map[string]string, CDIDevices []*runtime.CDIDevice) oci.SpecOpts {
func WithCDI(annotations map[string]string, CDIDevices []runtime.CDIDevice, CDIMounts []runtime.Mount) oci.SpecOpts { return func(ctx context.Context, client oci.Client, c containers.Container, s oci.Spec) error { seen := make(map[string]bool) // Add devices from CDIDevices CRI field var devices []string var err error ... +for _, mount := range CDIMounts {
- // encoded mount path
- encodedMntpath := DoEncodeBase64(mount.HostPath)
- mntDevice := "direct.volume/direct-volume=" + encodedMntpath
- if seen[mntDevice] {
- log.G(ctx).Debugf("Skipping duplicated CDI device %s", mntDevice)
- continue
- }
- devices = append(devices, mntDevice)
- seen[mntDevice] = true +} ...
  
  return oci.WithCDIDevices(devices...)(ctx, client, c, s) } }
CDI

(1) To better support using the base64-URLencoded string of the Mount HostPath as the DeviceName, it is necessary to modify the CDI DeviceName naming check rules to support special symbol =.

(2) As the runtime.Mount object in containerd does not have any additional information to help accurately filter out which HostPath is associated with the direct volume, all Mount.Hostpath objects are allowed to be used to construct DeviceNames and passed to CDI. Then it uses DeviceNames to match cdi Cache database devices. However, the final result is that there is only one Mount that needs to be modified for the direct volume. Therefore, the CDI InjectDevice function needs to be modified to not return an error if there is an unresolved device name.

$git diff container-device-interface/pkg/cdi/cache.go
diff --git a/container-device-interface/pkg/cdi/cache.go b/container-device-interface/pkg/cdi/cache.go
index cb495ebb3..01af15deb 100644
--- a/container-device-interface/pkg/cdi/cache.go
+++ b/container-device-interface/pkg/cdi/cache.go
@@ -243,15 +243,14 @@ func (c *Cache) InjectDevices(ociSpec *oci.Spec, devices ...string) ([]string, e
        }

        if unresolved != nil {
-               return unresolved, fmt.Errorf("unresolvable CDI devices %s",
-                       strings.Join(devices, ", "))
+               fmt.Printf("unresolvable CDI devices %s", strings.Join(devices, ", "))
        }

        if err := edits.Apply(ociSpec); err != nil {
                return nil, fmt.Errorf("failed to inject devices: %w", err)
        }

-       return nil, nil
+       return unresolved, nil
 }

Apokleos commented 1 year ago

@elezar @zvonkok @bart0sh

zvonkok commented 1 year ago

I need to read this carefully and check if this would also have some implications for the KEP: https://github.com/kubernetes/enhancements/pull/4113

Apokleos commented 1 year ago

@zvonkok Thx for your reply. I am looking forward to discussing this with you.

elezar commented 1 year ago

cc @pohly

pohly commented 1 year ago

There's also https://github.com/kubernetes/enhancements/pull/2893 which didn't get merged.

Apokleos commented 1 year ago

There's also kubernetes/enhancements#2893 which didn't get merged.

Thx @pohly I would like to confirm the relationship between KEP-2893 and the issue I raised. And I am concerned that the KEP-2893 does not seem to include support for kata direct-volume. Is my understanding correct?

pohly commented 1 year ago

I don't know (or remember).

Apokleos commented 1 year ago

@elezar @pohly @zvonkok I would appreciate your feedback on supporting the modification of kata direct volume types through CDI. Do you have any suggestions for optimizing and improving this proposal?

elezar commented 1 year ago

For clarification @Apokleos. Is the use of the device name to encode data a requirement, or is this just how you are implementing this in our current setup?

One issue with the code implemented above is that an unresolvable device is not treated as an error. This breaks assumptions around such devices. How would you propose missing / incorrect device specifications be passed to the user if this is not considered an error?

Apokleos commented 1 year ago

Thx @elezar for your feedback.

For clarification @Apokleos. Is the use of the device name to encode data a requirement, or is this just how you are implementing this in our current setup?

In this scenario, the Mount HostPath will be used as the device name. However, the HostPath may contain special characters, which are not allowed in the CDI naming specification. To comply with the CDI device name specification, the HostPath was encrypted using base64, even though this may result in special characters after encryption. This was done to maintain consistency with the encryption method used for volume paths in kata direct volumes.

One issue with the code implemented above is that an unresolvable device is not treated as an error. This breaks assumptions around such devices. How would you propose missing / incorrect device specifications be passed to the user if this is not considered an error?

Yes, this is indeed a problem. However, I have a different understanding of unresolved devices after deep diving the code. Do devices not found in the CDI cache belong to unresolved devices? (As the code of InjectDevices indicates that devices not found in the CDI cache belong to unresolved devices. ) IMO, devices not found in the CDI cache are filtered devices, while devices hit cache but failed to be applied are the unresolved devices.

I propose to split InjectDevices into two steps: filtering and injecting.

filtering The filtering step will be handled by a separate method called FilterDevices . In this step, if devices hit the CDI Cache, they will be the ones we want.
injecting The devices from filtering step will be passed to the injecting method InjectDevices. If a device fails to be applied, it will be considered as an unresolved device.

And the FilterDevices can be placed in RegistryResolver as below:

type RegistryResolver interface {
    InjectDevices(spec *oci.Spec, device ...string) (unresolved []string, err error)
+   FilterDevices(device ...string) (mached []string, err error)
}

Apokleos commented 1 year ago

Hi @elezar I'm interested in hearing your thoughts on the updated proposal. I'd like to discuss it further with you and look forward to hearing from you.

Thx.

Apokleos commented 1 year ago

Hi @klihub, Could I hear your suggestions for this proposal? Thx

klihub commented 1 year ago

Hi @klihub, Could I hear your suggestions for this proposal? Thx

My biggest problem with this, in its current form, is that we end up treating unresolvable CDI devices during injection as something to be merely logged instead of flagged as a fatal error. I don't think that is something we'd find an acceptable compromise.

Looking at your (very nice and detailed) description, I gather that the basic problem is roughly

there is a CSI driver that tries to create/serve/handle storage volumes for Kata containers
naturally we can't inject storage into a Kata container by simply bind-mounting a directory
yet, with the current infrastructure (K8s, CRI, runtime) there is no way to convey enough information, so we end up with a simple bind mount

And this proposal (and I guess your earlier hack attempt to the NRI device injector sample plugin to modify mount options) tries to come up with a way to 'patch up' the incorrectly/insufficiently created mount for the container after the fact.

@Apokleos Is this summary correct ?

Apokleos commented 1 year ago

Thank you @klihub for your valuable time. I appreciate your understanding of my proposal. I would like to discuss the unresolvable CDI devices with you further.

Hi @klihub, Could I hear your suggestions for this proposal? Thx

My biggest problem with this, in its current form, is that we end up treating unresolvable CDI devices during injection as something to be merely logged instead of flagged as a fatal error. I don't think that is something we'd find an acceptable compromise.

@elezar has the same concern about the unresolvable CDI devices in this proposal, and I updated the proposal What do you think of the new updates?

Looking at your (very nice and detailed) description, I gather that the basic problem is roughly

there is a CSI driver that tries to create/serve/handle storage volumes for Kata containers

naturally we can't inject storage into a Kata container by simply bind-mounting a directory

yet, with the current infrastructure (K8s, CRI, runtime) there is no way to convey enough information, so we end up with a simple bind mount

And this proposal (and I guess your earlier hack attempt to the NRI device injector sample plugin to modify mount options)

Yes, you're right. I try to do "patch up" the Mount's type with NRI annotation when I do some tests for kata/direct-volume with NRI. And

tries to come up with a way to 'patch up' the incorrectly/insufficiently created mount for the container after the fact.

@Apokleos Is this summary correct ?

Apokleos commented 1 year ago

Thx @elezar @klihub I appreciate your time to discuss this proposal with me. I have revised it based on your concerns and submitted a PR to clarify it further. To avoid violating the original design while supporting Mount, I added a dedicated method UpdateMounts to handle Mounts. This allows for unresolved behavior to be allowed in Mount scenarios without errors.

zvonkok commented 1 year ago

@Apokleos, @marquiz and I are working on this KEP: https://github.com/kubernetes/enhancements/pull/4113, especially for the Kata use-case. To summarize your issue, we need additional meta information passed down the CRI. Rather then "hacking" something into CDI I suggest we tackle the issue at the source if possible. @Apokleos Can you please review the KEP and add a comment about your use-case and what is missing?

Apokleos commented 1 year ago

@Apokleos, @marquiz and I are working on this KEP: kubernetes/enhancements#4113, especially for the Kata use-case. To summarize your issue, we need additional meta information passed down the CRI. Rather then "hacking" something into CDI I suggest we tackle the issue at the source if possible. @Apokleos Can you please review the KEP and add a comment about your use-case and what is missing?

Hi @zvonkok Thx for your suggestion for the KEP-4113. I have added some comments about my concern. IMO, the best way to fix this is to provide a way to pass down the Mount Type corresponding to the direct volume, used to build the oci Spec Mount, instead of the default specified type bind.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] commented 5 months ago

This issue was automatically closed due to inactivity.

cncf-tags / container-device-interface

Add support Edit Container Mount for kata/runtime-rs using direct volume in K8S/CSI scenario #162