Open bpradipt opened 3 months ago
Few questions that comes to my mind:
- Remove peerpodconfig-ctrl and advertise the node resources via cloud-api-adaptor itself?
- Should we focus on the new design for managing the VMs - https://github.com/confidential-containers/cloud-api-adaptor/issues/1534 ?
Can we look at those questions individually or are they inherently coupled? I think with regards to cloud resource management any robust solution will have to look at state management outside of the daemonset pod's memory. It could be a k8s controller-based solution like in the linked RFC. An alternative to that would be to use a persistent database and a control loop in the daemonset (afaiu that's what the GARM server does to manage resources). I'd be leaning towards the controller-based solution.
Few questions that comes to my mind:
- Remove peerpodconfig-ctrl and advertise the node resources via cloud-api-adaptor itself?
- Should we focus on the new design for managing the VMs - RFC: Simpler management of VM instances and PeerPod objects #1534 ?
Can we look at those questions individually or are they inherently coupled?
We can look at it individually.
Raised a PR to remove peerpodconfig-ctrl https://github.com/confidential-containers/cloud-api-adaptor/pull/2027
This PR in isolation is not of much use unless the webhook is also deployed as part of the install to ensure there is a max limit to the number of cloud instances that can be created by cloud-api-adaptor.
Currently the default operator based deployment doesn't deploy the complete stack
The following diagram shows the high level resource accounting and management for peer-pods
Ref: old deck on the resource accounting and management for peer-pods - https://docs.google.com/presentation/d/1GWNgQdRC5WxrXz_0XCW3DGIfzQHkO4MaN-8BlRPuTDc/edit#slide=id.g13a9839f269_0_0
The node extended resources are advertised by the peerpodconfig-ctrl. The earlier intention was to use peerpodconfig-ctrl to deploy all the required components for cloud-api-adaptor, but we are not yet there. This delay in implementation also gives us an opportunity to re-think the right approach.
Few questions that comes to my mind:
Additionally there is the issue deploying all the components via operator. There is some initial work that has happened it has created issues in the past with the release and test workflow. So this needs to be re-looked as well.
I'm starting this issue to kickstart the discussion so that we can address this important issue for the 0.10.0 release
cc @yoheiueda @mkulke @stevenhorsman @snir911 @huoqifeng