Add CDI support to peer pods

inatatsu commented 1 month ago

How can we enable Dynamic Resource Allocation (DRA) based on Container Device Interface (CDI) for peer pods?

K8s v1.26 introduced DRA and Kata agent is recently enabling CDI. In my understanding, when we want to use GPUs in a peer pod, we need to manually specify an instance profile with GPUs. The webhook simply removes nvidia.com/gpu device requests to an annotation kata.peerpods.io.gpus, but it seems to be not used to select an instance profile.

Any suggestions?

bpradipt commented 1 month ago

@inatatsu commenting on the webhook aspect. There are few plumbing work pending and I'm open for PRs :-)

Enable gpu annotation in Kata containers (similar to default_cpus, default_mem, machine_type etc) for remote hypervisor so that it's available in CAA before CreateVM request
Enable selection logic in CAA. I have started something here - https://github.com/bpradipt/cloud-api-adaptor/commit/d38f3d995348ea23ad4932c782ba23f2ea148d78

inatatsu commented 1 month ago

@bpradipt Thank you for your responses. Can we extend the current webhook-based approach to support DRA and generate a CDI spec in a peer pod VM?

stevenhorsman commented 1 month ago

@zvonkok - you might also be interested and extremely helpful here?

yoheiueda commented 1 month ago

According to this comment, it looks like CDI needs to be enabled in both runtime and kata-agent.

https://github.com/kata-containers/kata-containers/issues/9543#issuecomment-2366800482

The CDI support for runtime has been only enabled in runtime-rs, but not in Go version of kata-shim runtime.

https://github.com/kata-containers/kata-containers/issues/10145

I don't think runtime-rs supports the remote hypervisor for peer pods. Do we need to enable CDI in the Go version of kata-shim runtime?

yoheiueda commented 1 month ago

In the Go version of kata-shim runtime, the remote hypervisor just ignore devices for now. I think we also need to fix this when CDI support is enabled in the kata-shim runtime.

https://github.com/kata-containers/kata-containers/blob/ca416d883729c7888287a89de836d67bc0975528/src/runtime/virtcontainers/remote.go#L203-L215

func (rh *remoteHypervisor) AddDevice(ctx context.Context, devInfo interface{}, devType DeviceType) error {
    // TODO should we return notImplemented("AddDevice"), rather than nil and ignoring it?
    logrus.Printf("addDevice: deviceType=%v devInfo=%#v", devType, devInfo)
    return nil
}

func (rh *remoteHypervisor) HotplugAddDevice(ctx context.Context, devInfo interface{}, devType DeviceType) (interface{}, error) {
    return nil, notImplemented("HotplugAddDevice")
}

func (rh *remoteHypervisor) HotplugRemoveDevice(ctx context.Context, devInfo interface{}, devType DeviceType) (interface{}, error) {
    return nil, notImplemented("HotplugRemoveDevice")
}

yoheiueda commented 1 month ago

I think another possible workaround to support CDI in peer pods is to manipulate Devices in CreateContainerRequest by cloud-api-adaptor.

https://github.com/confidential-containers/cloud-api-adaptor/blob/aab207c82de836587bfa62e192bfd18a4af6d19a/src/cloud-api-adaptor/pkg/adaptor/proxy/service.go#L77-L82

bpradipt commented 1 month ago

@bpradipt Thank you for your responses. Can we extend the current webhook-based approach to support DRA and generate a CDI spec in a peer pod VM?

I think @yoheiueda proposal to do it in the CreateContainerRequest may be easier. We can just keep the webhook to handle resource removals from the spec which doesn't apply to peer-pods. Also I'm unclear how DRA will impact the peer-pods resource management? Can we do away with the webhook completely and rely on DRA for peer-pods resource management ?

zvonkok commented 4 weeks ago

There are several parts to the story. I am ramping up on peer-pods so excuse my ignorance on some parts. There are several aspects here.

Enable CDI in the kata-agent which is completely independent if peer-pods, or local VMM. This is enabled here: https://github.com/kata-containers/kata-containers/pull/9584. @bpradipt This will eliminate the prestart-hook.

I do not understand the complete webhook thing in peer-pods, but let's try to keep it simple and stupid.

We've build DRA to request special features of a GPU, like give me a GPU with 40G, MIG slice, vGPU or a specific architecture. I am still unsure how we're going to map this exactly with peer-pods since we do not know what the CSP pool is capable of.

We need some advertisement system (NFD) for CSP like infrastructure?

The peer pods add a new layer of complexity. I need to think of how to enable DRA and CDI.

zvonkok commented 4 weeks ago

@bpradipt We need to think how to enable DRA properly. The logic you have is a good start but ignores MIG, or vGPU.

Apokleos commented 4 weeks ago

According to this comment, it looks like CDI needs to be enabled in both runtime and kata-agent.

kata-containers/kata-containers#9543 (comment)

The CDI support for runtime has been only enabled in runtime-rs, but not in Go version of kata-shim runtime.

kata-containers/kata-containers#10145

I don't think runtime-rs supports the remote hypervisor for peer pods. Do we need to enable CDI in the Go version of kata-shim runtime?

Yes, both runtime and kata-agent need integrate with CDI. Currently AFAIK, kata runtime and runtime-rs have both support CDI for GPU scenarios. And another thing remote hypervisor in runtime-rs, is also under reviewing, which is a Project of Summer of Code

zvonkok commented 4 weeks ago

Hmm, since the mapping is Pod per CSP VM we need to make sure that DRA in the case of peer-pods only allows creation of GPUs that map to CSP instance types or have the Pod pending until the CSP implements the proper instance type :)

zvonkok commented 4 weeks ago

All the managment and configuration of devices is now pushed into DRA, whereas with device-plugins you consume what the infrastructure offers. We have a conflict here with peer-pods. In the bare-metal use-case we can request a full-passthrough GPU (vGPU) where DRA would bind a proper GPU to VFIO or MDEV and create the CDI spec with the vfio device and the CRI sends this to Kata which then passes-through the GPU and in the VM we use CDI to create the proper device nodes in the OCI spec to be mounted into the container.

In the case of peer-pods DRA would just act as a proxy to pass-through the wanted typed to peer-pods which then in the end would choose the proper instance-type and to the CSP magic.

yoheiueda commented 4 weeks ago

@zvonkok Thank you very much for the explanation of how CDI works with DRA.

And another thing https://github.com/kata-containers/kata-containers/pull/10225, is also under reviewing, which is a Project of Summer of Code

@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?

Apokleos commented 4 weeks ago

@zvonkok Thank you very much for the explanation of how CDI works with DRA.

And another thing kata-containers/kata-containers#10225, is also under reviewing, which is a Project of Summer of Code

@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?

Hah, Yeah, good point. I think I should invite AC members @stevenhorsman @fupanli @zvonkok .etc. to help answer this question.

stevenhorsman commented 4 weeks ago

@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?

The short answer here is yes. The more nuanced version is yes, but we are not sure on the timeframe. The current plan is for Kata Containers 4.0 to ship with runtime-rs as the default shim, but the go runtime won't be removed here, however it might have security fixes only, or best-effort feature support with all new features targeted primarily at the rust runtime first. In Kata Containers 5.0 I guess there is a reasonable chance that the go runtime will be removed entirely, but that is unlikely to be decided for a long time.

4.0 is planned for so time in 2025, but there is still quite a bit of work required to close the gap as listed in https://github.com/kata-containers/kata-containers/issues/8702 including the remote hypervisor support that @Apokleos mentioned.

inatatsu commented 3 weeks ago

@bpradipt @stevenhorsman @yoheiueda @zvonkok @Apokleos Thank you very much for your helpful comments. Let me summarize the discussions and suggestions (and my understanding😃). Feel free to correct or add anything:

A user can run a Pod which refers a ResourceClaim as a peer pod. The user can pass a structured parameter to define the allocated resource.
The requested resource will be actually allocated when a peer pod VM is created.
The resource request is reflected to a pod VM instance profile and a CDI spec used by the kata agent inside of the pod VM.
The worker node must advertise the available ResourceSlice in advance (This is somewhat similar to what is currently done by the peerpod webhook using the kata.peerpods.io/vm extended resources).
The container runtime in the worker node also must enable CDI.
We need some custom component (Kubelet plugin?) to pass through the resource allocation request to the cloud provider and pod VM while creating a dummy CDI spec for the container runtime on the worker node.

bpradipt commented 3 weeks ago

@inatatsu thanks for summarising it. Few inline questions for my understanding:

@bpradipt @stevenhorsman @yoheiueda @zvonkok @Apokleos Thank you very much for your helpful comments. Let me summarize the discussions and suggestions (and my understanding😃). Feel free to correct or add anything:

A user can run a Pod which refers a ResourceClaim as a peer pod. The user can pass a structured parameter to define the allocated resource.

The requested resource will be actually allocated when a peer pod VM is created.

The resource request is reflected to a pod VM instance profile and a CDI spec used by the kata agent inside of the pod VM.

The worker node must advertise the available ResourceSlice in advance (This is somewhat similar to what is currently done by the peerpod webhook using the kata.peerpods.io/vm extended resources).

Is this about advertising external VMs as resources instead of the current per node extended resources?

The container runtime in the worker node also must enable CDI.

We need some custom component (Kubelet plugin?) to pass through the resource allocation request to the cloud provider and pod VM while creating a dummy CDI spec for the container runtime on the worker node.

How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.

inatatsu commented 3 weeks ago

@bpradipt Thank you for your questions.

Is this about advertising external VMs as resources instead of the current per node extended resources?

While I did not imagine such a use case😅, it is interesting and may simplify the VM management. ResourceSlice can be also per node.

How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.

In my understanding, CDI allows flexible device mapping and is runtime-agnostic. But as you point out, peer pods primarily rely on selecting an appropriate instance profile (or flavor) to allocate resources, and CDI just provides a mapping between the resources and containers.

bpradipt commented 3 weeks ago

@bpradipt Thank you for your questions.

Is this about advertising external VMs as resources instead of the current per node extended resources?

While I did not imagine such a use case😅, it is interesting and may simplify the VM management. ResourceSlice can be also per node.

How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.

In my understanding, CDI allows flexible device mapping and is runtime-agnostic. But as you point out, peer pods primarily rely on selecting an appropriate instance profile (or flavor) to allocate resources, and CDI just provides a mapping between the resources and containers.

So, CDI will be helpful on the kata-agent side to assign the GPU (or other devices) to the container and additionally using the same building blocks (CDI). Is my understanding correct?

inatatsu commented 3 weeks ago

So, CDI will be helpful on the kata-agent side to assign the GPU (or other devices) to the container and additionally using the same building blocks (CDI). Is my understanding correct?

@bpradipt Yes. That's my current understanding.

inatatsu commented 2 weeks ago

* The container runtime in the worker node also must enable CDI.

The go runtime merged PRs to enable CDI:

@zvonkok Does this mean the go runtime (except for the remote hypervisor) already supports CDI?

confidential-containers / cloud-api-adaptor

Add CDI support to peer pods #2126