Open inatatsu opened 1 month ago
@inatatsu commenting on the webhook aspect. There are few plumbing work pending and I'm open for PRs :-)
@bpradipt Thank you for your responses. Can we extend the current webhook-based approach to support DRA and generate a CDI spec in a peer pod VM?
@zvonkok - you might also be interested and extremely helpful here?
According to this comment, it looks like CDI needs to be enabled in both runtime and kata-agent.
https://github.com/kata-containers/kata-containers/issues/9543#issuecomment-2366800482
The CDI support for runtime has been only enabled in runtime-rs, but not in Go version of kata-shim runtime.
https://github.com/kata-containers/kata-containers/issues/10145
I don't think runtime-rs supports the remote hypervisor for peer pods. Do we need to enable CDI in the Go version of kata-shim runtime?
In the Go version of kata-shim runtime, the remote hypervisor just ignore devices for now. I think we also need to fix this when CDI support is enabled in the kata-shim runtime.
func (rh *remoteHypervisor) AddDevice(ctx context.Context, devInfo interface{}, devType DeviceType) error {
// TODO should we return notImplemented("AddDevice"), rather than nil and ignoring it?
logrus.Printf("addDevice: deviceType=%v devInfo=%#v", devType, devInfo)
return nil
}
func (rh *remoteHypervisor) HotplugAddDevice(ctx context.Context, devInfo interface{}, devType DeviceType) (interface{}, error) {
return nil, notImplemented("HotplugAddDevice")
}
func (rh *remoteHypervisor) HotplugRemoveDevice(ctx context.Context, devInfo interface{}, devType DeviceType) (interface{}, error) {
return nil, notImplemented("HotplugRemoveDevice")
}
I think another possible workaround to support CDI in peer pods is to manipulate Devices in CreateContainerRequest by cloud-api-adaptor.
@bpradipt Thank you for your responses. Can we extend the current webhook-based approach to support DRA and generate a CDI spec in a peer pod VM?
I think @yoheiueda proposal to do it in the CreateContainerRequest may be easier. We can just keep the webhook to handle resource removals from the spec which doesn't apply to peer-pods. Also I'm unclear how DRA will impact the peer-pods resource management? Can we do away with the webhook completely and rely on DRA for peer-pods resource management ?
There are several parts to the story. I am ramping up on peer-pods so excuse my ignorance on some parts. There are several aspects here.
Enable CDI in the kata-agent which is completely independent if peer-pods, or local VMM. This is enabled here: https://github.com/kata-containers/kata-containers/pull/9584. @bpradipt This will eliminate the prestart-hook.
I do not understand the complete webhook thing in peer-pods, but let's try to keep it simple and stupid.
We've build DRA to request special features of a GPU, like give me a GPU with 40G, MIG slice, vGPU or a specific architecture. I am still unsure how we're going to map this exactly with peer-pods since we do not know what the CSP pool is capable of.
We need some advertisement system (NFD) for CSP like infrastructure?
The peer pods add a new layer of complexity. I need to think of how to enable DRA and CDI.
@bpradipt We need to think how to enable DRA properly. The logic you have is a good start but ignores MIG, or vGPU.
According to this comment, it looks like CDI needs to be enabled in both runtime and kata-agent.
kata-containers/kata-containers#9543 (comment)
The CDI support for runtime has been only enabled in runtime-rs, but not in Go version of kata-shim runtime.
kata-containers/kata-containers#10145
I don't think runtime-rs supports the remote hypervisor for peer pods. Do we need to enable CDI in the Go version of kata-shim runtime?
Yes, both runtime and kata-agent need integrate with CDI. Currently AFAIK, kata runtime and runtime-rs have both support CDI for GPU scenarios. And another thing remote hypervisor in runtime-rs, is also under reviewing, which is a
Project of Summer of Code
Hmm, since the mapping is Pod per CSP VM we need to make sure that DRA in the case of peer-pods only allows creation of GPUs that map to CSP instance types or have the Pod pending until the CSP implements the proper instance type :)
All the managment and configuration of devices is now pushed into DRA, whereas with device-plugins you consume what the infrastructure offers. We have a conflict here with peer-pods. In the bare-metal use-case we can request a full-passthrough GPU (vGPU) where DRA would bind a proper GPU to VFIO or MDEV and create the CDI spec with the vfio device and the CRI sends this to Kata which then passes-through the GPU and in the VM we use CDI to create the proper device nodes in the OCI spec to be mounted into the container.
In the case of peer-pods DRA would just act as a proxy to pass-through the wanted typed to peer-pods which then in the end would choose the proper instance-type and to the CSP magic.
@zvonkok Thank you very much for the explanation of how CDI works with DRA.
And another thing https://github.com/kata-containers/kata-containers/pull/10225, is also under reviewing, which is a Project of Summer of Code
@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?
@zvonkok Thank you very much for the explanation of how CDI works with DRA.
And another thing kata-containers/kata-containers#10225, is also under reviewing, which is a Project of Summer of Code
@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?
Hah, Yeah, good point. I think I should invite AC members @stevenhorsman @fupanli @zvonkok .etc. to help answer this question.
@Apokleos That sound great! I have a basic question regarding runtime-rs. At some point in the future, will the Go version of kata-shim runtime be deprecated and replaced with runtime-rs?
The short answer here is yes. The more nuanced version is yes, but we are not sure on the timeframe. The current plan is for Kata Containers 4.0 to ship with runtime-rs as the default shim, but the go runtime won't be removed here, however it might have security fixes only, or best-effort feature support with all new features targeted primarily at the rust runtime first. In Kata Containers 5.0 I guess there is a reasonable chance that the go runtime will be removed entirely, but that is unlikely to be decided for a long time.
4.0 is planned for so time in 2025, but there is still quite a bit of work required to close the gap as listed in https://github.com/kata-containers/kata-containers/issues/8702 including the remote hypervisor support that @Apokleos mentioned.
@bpradipt @stevenhorsman @yoheiueda @zvonkok @Apokleos Thank you very much for your helpful comments. Let me summarize the discussions and suggestions (and my understanding😃). Feel free to correct or add anything:
Pod
which refers a ResourceClaim
as a peer pod. The user can pass a structured parameter to define the allocated resource.ResourceSlice
in advance (This is somewhat similar to what is currently done by the peerpod webhook using the kata.peerpods.io/vm
extended resources).@inatatsu thanks for summarising it. Few inline questions for my understanding:
@bpradipt @stevenhorsman @yoheiueda @zvonkok @Apokleos Thank you very much for your helpful comments. Let me summarize the discussions and suggestions (and my understanding😃). Feel free to correct or add anything:
- A user can run a
Pod
which refers aResourceClaim
as a peer pod. The user can pass a structured parameter to define the allocated resource.- The requested resource will be actually allocated when a peer pod VM is created.
- The resource request is reflected to a pod VM instance profile and a CDI spec used by the kata agent inside of the pod VM.
- The worker node must advertise the available
ResourceSlice
in advance (This is somewhat similar to what is currently done by the peerpod webhook using thekata.peerpods.io/vm
extended resources).
Is this about advertising external VMs as resources instead of the current per node extended resources?
- The container runtime in the worker node also must enable CDI.
- We need some custom component (Kubelet plugin?) to pass through the resource allocation request to the cloud provider and pod VM while creating a dummy CDI spec for the container runtime on the worker node.
How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.
@bpradipt Thank you for your questions.
Is this about advertising external VMs as resources instead of the current per node extended resources?
While I did not imagine such a use case😅, it is interesting and may simplify the VM management. ResourceSlice
can be also per node.
How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.
In my understanding, CDI allows flexible device mapping and is runtime-agnostic. But as you point out, peer pods primarily rely on selecting an appropriate instance profile (or flavor) to allocate resources, and CDI just provides a mapping between the resources and containers.
@bpradipt Thank you for your questions.
Is this about advertising external VMs as resources instead of the current per node extended resources?
While I did not imagine such a use case😅, it is interesting and may simplify the VM management.
ResourceSlice
can be also per node.How CDI is useful for peer-pods case? The availability of the GPU resource is taken care by the cloud infra provider and all GPUs available in the VM gets allocated to the pod as there is 1-1 mapping between VM and pod.
In my understanding, CDI allows flexible device mapping and is runtime-agnostic. But as you point out, peer pods primarily rely on selecting an appropriate instance profile (or flavor) to allocate resources, and CDI just provides a mapping between the resources and containers.
So, CDI will be helpful on the kata-agent side to assign the GPU (or other devices) to the container and additionally using the same building blocks (CDI). Is my understanding correct?
So, CDI will be helpful on the kata-agent side to assign the GPU (or other devices) to the container and additionally using the same building blocks (CDI). Is my understanding correct?
@bpradipt Yes. That's my current understanding.
* The container runtime in the worker node also must enable CDI.
The go runtime merged PRs to enable CDI:
@zvonkok Does this mean the go runtime (except for the remote hypervisor) already supports CDI?
How can we enable Dynamic Resource Allocation (DRA) based on Container Device Interface (CDI) for peer pods?
K8s v1.26 introduced DRA and Kata agent is recently enabling CDI. In my understanding, when we want to use GPUs in a peer pod, we need to manually specify an instance profile with GPUs. The webhook simply removes
nvidia.com/gpu
device requests to an annotationkata.peerpods.io.gpus
, but it seems to be not used to select an instance profile.Any suggestions?