Ensure peer-pods deployment is complete w.r.to webhook and VM resource limits

bpradipt commented 3 months ago

Currently the default operator based deployment doesn't deploy the complete stack

Mutating webhook is not deployed: This affects resource management of peer pod VMs
Node extended resources are not advertised: This affects resource accounting and management of peer pod VMs

The following diagram shows the high level resource accounting and management for peer-pods

Ref: old deck on the resource accounting and management for peer-pods - https://docs.google.com/presentation/d/1GWNgQdRC5WxrXz_0XCW3DGIfzQHkO4MaN-8BlRPuTDc/edit#slide=id.g13a9839f269_0_0

The node extended resources are advertised by the peerpodconfig-ctrl. The earlier intention was to use peerpodconfig-ctrl to deploy all the required components for cloud-api-adaptor, but we are not yet there. This delay in implementation also gives us an opportunity to re-think the right approach.

Few questions that comes to my mind:

Remove peerpodconfig-ctrl and advertise the node resources via cloud-api-adaptor itself?
Should we focus on the new design for managing the VMs - https://github.com/confidential-containers/cloud-api-adaptor/issues/1534 ?

Additionally there is the issue deploying all the components via operator. There is some initial work that has happened it has created issues in the past with the release and test workflow. So this needs to be re-looked as well.

I'm starting this issue to kickstart the discussion so that we can address this important issue for the 0.10.0 release

cc @yoheiueda @mkulke @stevenhorsman @snir911 @huoqifeng

mkulke commented 3 months ago

Few questions that comes to my mind:

Remove peerpodconfig-ctrl and advertise the node resources via cloud-api-adaptor itself?

Should we focus on the new design for managing the VMs - https://github.com/confidential-containers/cloud-api-adaptor/issues/1534 ?

Can we look at those questions individually or are they inherently coupled? I think with regards to cloud resource management any robust solution will have to look at state management outside of the daemonset pod's memory. It could be a k8s controller-based solution like in the linked RFC. An alternative to that would be to use a persistent database and a control loop in the daemonset (afaiu that's what the GARM server does to manage resources). I'd be leaning towards the controller-based solution.

bpradipt commented 2 months ago

Few questions that comes to my mind:

Remove peerpodconfig-ctrl and advertise the node resources via cloud-api-adaptor itself?

Should we focus on the new design for managing the VMs - RFC: Simpler management of VM instances and PeerPod objects #1534 ?

Can we look at those questions individually or are they inherently coupled?

We can look at it individually.

bpradipt commented 1 month ago

Raised a PR to remove peerpodconfig-ctrl https://github.com/confidential-containers/cloud-api-adaptor/pull/2027

This PR in isolation is not of much use unless the webhook is also deployed as part of the install to ensure there is a max limit to the number of cloud instances that can be created by cloud-api-adaptor.

confidential-containers / cloud-api-adaptor

Ensure peer-pods deployment is complete w.r.to webhook and VM resource limits #1976