confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
48 stars 88 forks source link

orphan VM resources #1847

Open mythi opened 6 months ago

mythi commented 6 months ago

I had set up CAA on Azure following the website instructions. The nginx deployment worked fine for some time but something makes the pod crashing/restarting and eventually it gets stuck with ContainerCreating.

Taking a closer look, I can see Current Limit: 350, Current Usage: 350, Additional Required Azure VM resource quote limit exceeded and lots of peer-pods VM running.

Checking the peer-pods daemonset logs, I can see

2024/05/28 00:50:22 [adaptor/proxy] shutting down socket forwarder
2024/05/28 00:50:22 [adaptor/cloud/azure] finding VM name using regexp:[]
2024/05/28 00:50:22 [adaptor/cloud] Error deleting an instance : VM name not found
bpradipt commented 6 months ago

peerpod-ctrl is meant to reap orphan VMs. Also using node extended resources for capacity management and a webhook to mutate the pod spec to add peerpod node extended resources will avoid resource misuse. Ref- https://github.com/confidential-containers/cloud-api-adaptor/tree/main/src/webhook

This needs to be added to the website instructions. cc @surajssd

mkulke commented 5 months ago

Even without the peerpod-ctrl the VMs should be garbage collected (unless the CAA daemonset is in a crash-loop too and loses the state). It would be interesting how CAA would end up in a state where this happens. If peerpod-ctrl is the only way to get VMs + resources removed reliably we need to make it part of the normal installation routine, IMO.

EmmEff commented 3 months ago

There is a similar scenario on aws as well. If the peervm instance doesn't start within the timeout period (as an aside, is this configurable somewhere? That is the time it waits for the connection made to the agent), successive instances are launched without removing the previous.

bpradipt commented 3 months ago

By default peerpod-ctrl is deployed (ref- https://github.com/confidential-containers/cloud-api-adaptor/blob/main/src/cloud-api-adaptor/Makefile#L27C1-L27C14) which should terminate the failed instances. @EmmEff do you see the instances in running state which should otherwise have been terminated ?

As for timeouts, there are the ones which affects

  1. PROXY_TIMEOUT in peer-pods-cm: Default is 5m.
  2. remote_hypervisor_timeout in configuration-remote.toml: Default is 3min.
EmmEff commented 3 months ago

The scenario that I am seeing regularly is where it times out when launching a pod and a replacement podvm instance is launched without terminating the first instance.

EmmEff commented 3 months ago

~I just submitted the PR https://github.com/confidential-containers/cloud-api-adaptor/pull/2007 which should mitigate the proxy timeout issue. Still working on the clean up/retry workflow to ensure peerpods VMs aren't being orphaned.~

The issue is the remote_hypervisor_timeout, though there was previously no timeout associated with the proxy dialer. I will rework the PR.