Optimising Pod VM size and startup time

bpradipt commented 2 years ago

To make peer-pods approach a viable solution for serverless computing, startup times matter. Even otherwise the startup time should be comparable to using Kata with local hypervisor (qemu/clh).

This is a tracker issue to document findings and identify scope of improvement.

[ ] Identify bare minimal requirement for a pod VM image (kernel, minimal user-space)
[ ] Identify improvement areas to speed up container startup time

tumberino commented 1 year ago

Do we have a baseline for what "startup time ... using Kata with local hypervisor" is?

bpradipt commented 1 year ago

@jtumber-ibm I'll get some baseline pod startup time data and update this issue.

abhbaner commented 1 year ago

@bpradipt @jtumber-ibm Tested on AWS Openshift cluster

Following startup time was observed For peer pod t3.small instance - 1 minute 41 second

Name:     hello-openshift
Namespace:  default
Priority:   0
Node:     ip-10-0-187-226.us-east-2.compute.internal/10.0.187.226
Start Time:  Wed, 08 Feb 2023 18:11:16 +0530
Labels:    app=hello-openshift
Annotations: [k8s.v1.cni.cncf.io/network-status](http://k8s.v1.cni.cncf.io/network-status):
        [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
            "10.128.2.22"
          ],
          "default": true,
          "dns": {}
        }]
       [k8s.v1.cni.cncf.io/networks-status](http://k8s.v1.cni.cncf.io/networks-status):
        [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
            "10.128.2.22"
          ],
          "default": true,
          "dns": {}
        }]
       [kata.peerpods.io/instance_type](http://kata.peerpods.io/instance_type): t2.small
Status:    Running
IP:      10.128.2.22
IPs:
 IP: 10.128.2.22
Containers:
 hello-openshift:
  Container ID:  cri-o://07062a709864f4e32723adfe1fd9c9d51464d76e9ca938dc286373c1d5f5fb85
  Image:     [quay.io/openshift/origin-hello-openshift](http://quay.io/openshift/origin-hello-openshift)
  Image ID:    [quay.io/openshift/origin-hello-openshift@sha256:369c8324bea625fdb94c1e8cc992ae9d2b0fe780acace7b8c8080349792f3b96](http://quay.io/openshift/origin-hello-openshift@sha256:369c8324bea625fdb94c1e8cc992ae9d2b0fe780acace7b8c8080349792f3b96)
  Port:      8888/TCP
  Host Port:   0/TCP
  State:     Running
   Started:   Wed, 08 Feb 2023 18:12:47 +0530
  Ready:     True
  Restart Count: 0
  Limits:
   [kata.peerpods.io/vm](http://kata.peerpods.io/vm): 1
  Requests:
   [kata.peerpods.io/vm](http://kata.peerpods.io/vm): 1
  Environment:      <none>
  Mounts:
   /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ntdth (ro)
Conditions:
 Type       Status
 Initialized    True 
 Ready       True 
 ContainersReady  True 
 PodScheduled   True 
Volumes:
 kube-api-access-ntdth:
  Type:          Projected (a volume that contains injected data from multiple sources)
  TokenExpirationSeconds: 3607
  ConfigMapName:      kube-root-ca.crt
  ConfigMapOptional:    <nil>
  DownwardAPI:       true
  ConfigMapName:      openshift-service-ca.crt
  ConfigMapOptional:    <nil>
QoS Class:          BestEffort
Node-Selectors:       <none>
Tolerations:         [node.kubernetes.io/not-ready:NoExecute](http://node.kubernetes.io/not-ready:NoExecute) op=Exists for 300s
               [node.kubernetes.io/unreachable:NoExecute](http://node.kubernetes.io/unreachable:NoExecute) op=Exists for 300s
Events:
 Type  Reason     Age  From   Message
 ----  ------     ---- ----   -------
 Normal AddedInterface 100s multus  Add eth0 [10.128.2.22/23] from openshift-sdn
 Normal Pulling     14s  kubelet Pulling image "[quay.io/openshift/origin-hello-openshift](http://quay.io/openshift/origin-hello-openshift)"
 Normal Pulled     14s  kubelet Successfully pulled image "[quay.io/openshift/origin-hello-openshift](http://quay.io/openshift/origin-hello-openshift)" in 609.587941ms
 Normal Created     11s  kubelet Created container hello-openshift
 Normal Started     11s  kubelet Started container hello-openshift

For peer pod c5.xlarge instance - 2 minute 45 second

Name:     hello-openshift
Namespace:  default
Priority:   0
Node:     ip-10-0-187-226.us-east-2.compute.internal/10.0.187.226
Start Time:  Wed, 08 Feb 2023 18:00:31 +0530
Labels:    app=hello-openshift
Annotations: [k8s.v1.cni.cncf.io/network-status](http://k8s.v1.cni.cncf.io/network-status):
        [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
            "10.128.2.21"
          ],
          "default": true,
          "dns": {}
        }]
       [k8s.v1.cni.cncf.io/networks-status](http://k8s.v1.cni.cncf.io/networks-status):
        [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
            "10.128.2.21"
          ],
          "default": true,
          "dns": {}
        }]
       [kata.peerpods.io/instance_type](http://kata.peerpods.io/instance_type): t2.small
Status:    Running
IP:      10.128.2.21
IPs:
 IP: 10.128.2.21
Containers:
 hello-openshift:
  Container ID:  cri-o://cbd66cd5255be1d61a0fa4440b8480ffe8a9b89b71dc288a2fef561f81980b90
  Image:     [quay.io/openshift/origin-hello-openshift](http://quay.io/openshift/origin-hello-openshift)
  Image ID:    [quay.io/openshift/origin-hello-openshift@sha256:369c8324bea625fdb94c1e8cc992ae9d2b0fe780acace7b8c8080349792f3b96](http://quay.io/openshift/origin-hello-openshift@sha256:369c8324bea625fdb94c1e8cc992ae9d2b0fe780acace7b8c8080349792f3b96)
  Port:      8888/TCP
  Host Port:   0/TCP
  State:     Running
   Started:   Wed, 08 Feb 2023 18:01:56 +0530
  Ready:     True
  Restart Count: 0
  Limits:
   [kata.peerpods.io/vm](http://kata.peerpods.io/vm): 1
  Requests:
   [kata.peerpods.io/vm](http://kata.peerpods.io/vm): 1
  Environment:      <none>
  Mounts:
   /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dwdsx (ro)
Conditions:
 Type       Status
 Initialized    True 
 Ready       True 
 ContainersReady  True 
 PodScheduled   True 
Volumes:
 kube-api-access-dwdsx:
  Type:          Projected (a volume that contains injected data from multiple sources)
  TokenExpirationSeconds: 3607
  ConfigMapName:      kube-root-ca.crt
  ConfigMapOptional:    <nil>
  DownwardAPI:       true
  ConfigMapName:      openshift-service-ca.crt
  ConfigMapOptional:    <nil>
QoS Class:          BestEffort
Node-Selectors:       <none>
Tolerations:         [node.kubernetes.io/not-ready:NoExecute](http://node.kubernetes.io/not-ready:NoExecute) op=Exists for 300s
               [node.kubernetes.io/unreachable:NoExecute](http://node.kubernetes.io/unreachable:NoExecute) op=Exists for 300s
Events:
 Type  Reason     Age  From   Message
 ----  ------     ----  ----   -------
 Normal AddedInterface 2m10s multus  Add eth0 [10.128.2.21/23] from openshift-sdn
 Normal Pulling     50s  kubelet Pulling image "[quay.io/openshift/origin-hello-openshift](http://quay.io/openshift/origin-hello-openshift)"
 Normal Pulled     49s  kubelet Successfully pulled image "[quay.io/openshift/origin-hello-openshift](http://quay.io/openshift/origin-hello-openshift)" in 549.836917ms
 Normal Created     47s  kubelet Created container hello-openshift
 Normal Started     47s  kubelet Started container hello-openshift

For runc pod (runtime default) - 7 seconds

Name:     nginx
Namespace:  default
Priority:   0
Node:     ip-10-0-219-218.us-east-2.compute.internal/10.0.219.218
Start Time:  Wed, 08 Feb 2023 17:47:28 +0530
Labels:    run=nginx
Annotations: [k8s.v1.cni.cncf.io/network-status](http://k8s.v1.cni.cncf.io/network-status):
        [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
            "10.131.0.18"
          ],
          "default": true,
          "dns": {}
        }]
       [k8s.v1.cni.cncf.io/networks-status](http://k8s.v1.cni.cncf.io/networks-status):
        [{
          "name": "openshift-sdn",
          "interface": "eth0",
          "ips": [
            "10.131.0.18"
          ],
          "default": true,
          "dns": {}
        }]
Status:    Running
IP:      10.131.0.18
IPs:
 IP: 10.131.0.18
Containers:
 nginx:
  Container ID: cri-o://17418aed26b8c36b4aca40f3724c620ee4022440109593ef0dcf2449939f9994
  Image:     [quay.io/bpradipt/http-echo:nonroot](http://quay.io/bpradipt/http-echo:nonroot)
  Image ID:   [quay.io/bpradipt/http-echo@sha256:ec3c0642ff30050ea3f462c8f662ffa53036f6a3e089085714c7ff13501002dc](http://quay.io/bpradipt/http-echo@sha256:ec3c0642ff30050ea3f462c8f662ffa53036f6a3e089085714c7ff13501002dc)
  Port:     <none>
  Host Port:   <none>
  Args:
   -text
   Hello from Kata container
   -listen
   :8080
  State:     Running
   Started:   Wed, 08 Feb 2023 17:47:33 +0530
  Ready:     True
  Restart Count: 0
  Environment:  <none>
  Mounts:
   /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-77wkg (ro)
Conditions:
 Type       Status
 Initialized    True 
 Ready       True 
 ContainersReady  True 
 PodScheduled   True 
Volumes:
 kube-api-access-77wkg:
  Type:          Projected (a volume that contains injected data from multiple sources)
  TokenExpirationSeconds: 3607
  ConfigMapName:      kube-root-ca.crt
  ConfigMapOptional:    <nil>
  DownwardAPI:       true
  ConfigMapName:      openshift-service-ca.crt
  ConfigMapOptional:    <nil>
QoS Class:          BestEffort
Node-Selectors:       <none>
Tolerations:         [node.kubernetes.io/not-ready:NoExecute](http://node.kubernetes.io/not-ready:NoExecute) op=Exists for 300s
               [node.kubernetes.io/unreachable:NoExecute](http://node.kubernetes.io/unreachable:NoExecute) op=Exists for 300s
Events:
 Type  Reason     Age  From        Message
 ----  ------     ---- ----        -------
 Normal Scheduled    12s  default-scheduler Successfully assigned default/nginx to ip-10-0-219-218.us-east-2.compute.internal
 Normal AddedInterface 10s  multus       Add eth0 [10.131.0.18/23] from openshift-sdn
 Normal Pulling     9s  kubelet      Pulling image "[quay.io/bpradipt/http-echo:nonroot](http://quay.io/bpradipt/http-echo:nonroot)"
 Normal Pulled     7s  kubelet      Successfully pulled image "[quay.io/bpradipt/http-echo:nonroot](http://quay.io/bpradipt/http-echo:nonroot)" in 2.274906876s
 Normal Created     7s  kubelet      Created container nginx
 Normal Started     7s  kubelet      Started container nginx

bpradipt commented 1 year ago

Thanks a lot @abhbaner. Can you please confirm the pod VM image distro - CentOS, RHEL or Ubuntu ?

abhbaner commented 1 year ago

The podVM image distro is CentOs.

bpradipt commented 1 year ago

Putting the timing data for easier understanding

runc

 Normal Scheduled    12s  default-scheduler Successfully assigned default/nginx to ip-10-0-219-218.us-east-2.compute.internal
 Normal AddedInterface 10s  multus       Add eth0 [10.131.0.18/23] from openshift-sdn
 Normal Pulling     9s  kubelet      Pulling image "[quay.io/bpradipt/http-echo:nonroot](http://quay.io/bpradipt/http-echo:nonroot)"
 Normal Pulled     7s  kubelet      Successfully pulled image "[quay.io/bpradipt/http-echo:nonroot](http://quay.io/bpradipt/http-echo:nonroot)" in 2.274906876s
 Normal Created     7s  kubelet      Created container nginx
 Normal Started     7s  kubelet      Started container nginx

peer-pod (c5.xlarge)

Normal AddedInterface 2m10s multus  Add eth0 [10.128.2.21/23] from openshift-sdn
 Normal Pulling     50s  kubelet Pulling image "[quay.io/openshift/origin-hello-openshift](http://quay.io/openshift/origin-hello-openshift)"
 Normal Pulled     49s  kubelet Successfully pulled image "[quay.io/openshift/origin-hello-openshift](http://quay.io/openshift/origin-hello-openshift)" in 549.836917ms
 Normal Created     47s  kubelet Created container hello-openshift
 Normal Started     47s  kubelet Started container hello-openshift

peer-pod(t3.small)

Normal AddedInterface 100s multus  Add eth0 [10.128.2.22/23] from openshift-sdn
 Normal Pulling     14s  kubelet Pulling image "[quay.io/openshift/origin-hello-openshift](http://quay.io/openshift/origin-hello-openshift)"
 Normal Pulled     14s  kubelet Successfully pulled image "[quay.io/openshift/origin-hello-openshift](http://quay.io/openshift/origin-hello-openshift)" in 609.587941ms
 Normal Created     11s  kubelet Created container hello-openshift
 Normal Started     11s  kubelet Started container hello-openshift

bpradipt commented 1 year ago

A rough sequence diagram of the Pod VM startup process

bpradipt commented 1 year ago

@jtumber-ibm @stevenhorsman @surajssd @yoheiueda @huoqifeng

yoheiueda commented 1 year ago

I did the same experiments on my cluster in IBM Cloud VPC.

runc

  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  4s    default-scheduler  Successfully assigned default/nginx to k8s-worker
  Normal  Pulling    4s    kubelet            Pulling image "nginx"
  Normal  Pulled     3s    kubelet            Successfully pulled image "nginx" in 1.173574561s
  Normal  Created    3s    kubelet            Created container nginx
  Normal  Started    3s    kubelet            Started container nginx

peer pod

  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  44s   default-scheduler  Successfully assigned default/nginx to k8s-worker
  Normal  Pulling    7s    kubelet            Pulling image "nginx"
  Normal  Pulled     3s    kubelet            Successfully pulled image "nginx" in 4.218585385s
  Normal  Created    3s    kubelet            Created container nginx
  Normal  Started    1s    kubelet            Started container nginx

I calculated an elapsed time in each state transition as follows. I think the container image is already cached in the case of runc.

State Transition	runc	peer-pod
Scheduled→Pulling	0s	37s
Pulling→Pulled	1s	4s
Pulled→Created	0s	0s
Created→Started	0s	2s

mkulke commented 1 year ago

I suspect getting close to a baseline with local kata on kvm/qemu will be tough. I don't have concrete measurements at the moment, but time-to-pod on local kata was a matter of a few seconds (~3-10s), even for an SEV-SNP enabled guest, which has a bit of an init overhead.

Below is a measurement from a CAA test, it's not broken down by the individual CAA steps, but you can kind of extrapolate from the k8s logs:

The biggest chunk of time is the c(vm) provisioning. I think this will apply to most CSPs.
According to our tests, the size of the os-disk (we compared 5GB and 30GB) doesn't make a dent in the overall startup time. However, that depends on the CSP and a lot of other factors like provisioning method, storage class, etc...
The time it takes from a successful VM creation to pod startup can be broken down into boot (time it takes for agent-protocol-forwarder to be responsive and kata-agent's pod provisioning (pulling images, etc).
boot time should be deterministic and in a stripped down os-disks can be limited to a few seconds (~4s in our tests). Running systemd-analyze plot on a peer-pod vm can yield possible opportunities for improvement.
pod-provisoning depends on the workload (number of containers, size, image registry, etc.. For a simple nginx pod with sidecar this amounted to 2s per container, so ~4s). That number is most likely comparable to local kata, it's open how future image sharing/retrieval schemes will impact this number, since the "host" is remote in this case. Improvements could still be conceivable for peer pods (e.g. remote, block based lazy pulling)

kubectl get events -o custom-columns=Timestamp:.firstTimestamp,Message:.message --field-selector involvedObject.name=nginx-7df44f78b6-tgxnx                                                    
Timestamp              Message                                                     
2023-01-27T08:54:06Z   Successfully assigned default/nginx-7df44f78b6-tgxnx to k8s-ci                                                                             
2023-01-27T08:54:51Z   Container image "nginx:1.14.2" already present on machine   
2023-01-27T08:54:51Z   Created container nginx                                     
2023-01-27T08:54:53Z   Started container nginx                                     
2023-01-27T08:54:53Z   Container image "*" already present on machine                                                      
2023-01-27T08:54:53Z   Created container sidecar                            
2023-01-27T08:54:55Z   Started container sidecar

bpradipt commented 1 year ago

Some additional data purely from the pod VM image standpoint. I used both CentOS Stream 8 and Ubuntu 20.04 cloud images for relative comparison.

CentOS Stream8 Cloud image

$ oc get pods
NAME                         READY   STATUS    RESTARTS   AGE
echo-kata-67ccb48d7b-wpwnq   1/1     Running   0          69s

[root@podvm-echo-kata-67ccb48d7b-wpwnq-6e216995 peerpod]# systemd-analyze time
Startup finished in 1.266s (kernel) + 3.378s (initrd) + 47.053s (userspace) = 51.698s
multi-user.target reached after 29.136s in userspace

[root@podvm-echo-kata-67ccb48d7b-wpwnq-6e216995 peerpod]# systemd-analyze blame
         24.677s cloud-init.service
         18.057s kdump.service
          3.278s NetworkManager-wait-online.service
          2.283s cloud-init-local.service
          1.062s initrd-switch-root.service
           773ms dracut-initqueue.service
[snip]

Disabling kdump.service from the pod VM image gave the following results:

[kcli@virtlab714 peerpods-files]$ oc get pods
NAME                         READY   STATUS    RESTARTS   AGE
echo-kata-67ccb48d7b-gt9nv   1/1     Running   0          59s

[peerpod@podvm-echo-kata-67ccb48d7b-gt9nv-4b58c292 peerpod]# systemd-analyze time
Startup finished in 1.237s (kernel) + 3.378s (initrd) + 30.510s (userspace) = 35.126s
multi-user.target reached after 29.231s in userspace
[root@podvm-echo-kata-67ccb48d7b-gt9nv-4b58c292 peerpod]# systemd-analyze blame
         24.898s cloud-init.service
          3.361s NetworkManager-wait-online.service
          2.264s cloud-init-local.service
          1.001s initrd-switch-root.service
           776ms dracut-initqueue.service

Ubuntu 20.04 cloud image

$ oc get pods
NAME                         READY   STATUS    RESTARTS   AGE
echo-kata-67ccb48d7b-r4bcb   1/1     Running   0          70s

root@podvm-echo-kata-67ccb48d7b-r4bcb-15a38e4a:/home/peerpod# systemd-analyze time
Startup finished in 4.150s (kernel) + 14.571s (userspace) = 18.721s
graphical.target reached after 13.728s in userspace
root@podvm-echo-kata-67ccb48d7b-r4bcb-15a38e4a:/home/peerpod# systemd-analyze blame
41.390s apt-daily.service
 3.082s snapd.service
 3.029s systemd-networkd-wait-online.service
 2.803s snap.lxd.activate.service
 2.624s cloud-init-local.service
 1.549s cloud-init.service
 1.414s cloud-config.service
 1.383s apt-daily-upgrade.service
 1.234s dev-sda1.device
 1.222s ua-timer.service

Disabling apt-daily.service gave the following results:

$ oc get pods
NAME                         READY   STATUS    RESTARTS   AGE
echo-kata-67ccb48d7b-q4q8r   1/1     Running   0          39s

root@podvm-echo-kata-67ccb48d7b-q4q8r-faf29ce7:/home/peerpod# systemd-analyze time
Startup finished in 4.162s (kernel) + 14.206s (userspace) = 18.369s
graphical.target reached after 13.457s in userspace
root@podvm-echo-kata-67ccb48d7b-q4q8r-faf29ce7:/home/peerpod# systemd-analyze blame
3.025s systemd-networkd-wait-online.service
2.683s snap.lxd.activate.service
2.363s cloud-init-local.service
2.269s snapd.service
1.848s cloud-init.service

bpradipt commented 1 year ago

@yoheiueda @mkulke

confidential-containers / cloud-api-adaptor

Optimising Pod VM size and startup time #343

runc

peer pod