confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
44 stars 71 forks source link

orphan VM resources #1847

Open mythi opened 1 month ago

mythi commented 1 month ago

I had set up CAA on Azure following the website instructions. The nginx deployment worked fine for some time but something makes the pod crashing/restarting and eventually it gets stuck with ContainerCreating.

Taking a closer look, I can see Current Limit: 350, Current Usage: 350, Additional Required Azure VM resource quote limit exceeded and lots of peer-pods VM running.

Checking the peer-pods daemonset logs, I can see

2024/05/28 00:50:22 [adaptor/proxy] shutting down socket forwarder
2024/05/28 00:50:22 [adaptor/cloud/azure] finding VM name using regexp:[]
2024/05/28 00:50:22 [adaptor/cloud] Error deleting an instance : VM name not found
bpradipt commented 1 month ago

peerpod-ctrl is meant to reap orphan VMs. Also using node extended resources for capacity management and a webhook to mutate the pod spec to add peerpod node extended resources will avoid resource misuse. Ref- https://github.com/confidential-containers/cloud-api-adaptor/tree/main/src/webhook

This needs to be added to the website instructions. cc @surajssd

mkulke commented 2 weeks ago

Even without the peerpod-ctrl the VMs should be garbage collected (unless the CAA daemonset is in a crash-loop too and loses the state). It would be interesting how CAA would end up in a state where this happens. If peerpod-ctrl is the only way to get VMs + resources removed reliably we need to make it part of the normal installation routine, IMO.