Open mythi opened 6 months ago
peerpod-ctrl is meant to reap orphan VMs. Also using node extended resources for capacity management and a webhook to mutate the pod spec to add peerpod node extended resources will avoid resource misuse. Ref- https://github.com/confidential-containers/cloud-api-adaptor/tree/main/src/webhook
This needs to be added to the website instructions. cc @surajssd
Even without the peerpod-ctrl the VMs should be garbage collected (unless the CAA daemonset is in a crash-loop too and loses the state). It would be interesting how CAA would end up in a state where this happens. If peerpod-ctrl is the only way to get VMs + resources removed reliably we need to make it part of the normal installation routine, IMO.
There is a similar scenario on aws as well. If the peervm instance doesn't start within the timeout period (as an aside, is this configurable somewhere? That is the time it waits for the connection made to the agent), successive instances are launched without removing the previous.
By default peerpod-ctrl is deployed (ref- https://github.com/confidential-containers/cloud-api-adaptor/blob/main/src/cloud-api-adaptor/Makefile#L27C1-L27C14) which should terminate the failed instances. @EmmEff do you see the instances in running state which should otherwise have been terminated ?
As for timeouts, there are the ones which affects
The scenario that I am seeing regularly is where it times out when launching a pod and a replacement podvm instance is launched without terminating the first instance.
~I just submitted the PR https://github.com/confidential-containers/cloud-api-adaptor/pull/2007 which should mitigate the proxy timeout issue. Still working on the clean up/retry workflow to ensure peerpods VMs aren't being orphaned.~
The issue is the remote_hypervisor_timeout
, though there was previously no timeout associated with the proxy dialer. I will rework the PR.
I had set up CAA on Azure following the website instructions. The nginx deployment worked fine for some time but something makes the pod crashing/restarting and eventually it gets stuck with
ContainerCreating
.Taking a closer look, I can see
Current Limit: 350, Current Usage: 350, Additional Required
Azure VM resource quote limit exceeded and lots of peer-pods VM running.Checking the peer-pods daemonset logs, I can see