confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
44 stars 71 forks source link

ibmcloud: Peer pods fail during CreateContainer #1882

Open stevenhorsman opened 4 days ago

stevenhorsman commented 4 days ago

When creating an ibmcloud set up on with a self-managed cluster with both s390x and amd64 architectures, the tests fail.

The pod describe looks like:

Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulling  49m (x19 over 134m)     kubelet  Pulling image "quay.io/prometheus/busybox:latest"
  Warning  Failed   23m (x23 over 133m)     kubelet  Error: failed to create containerd task: failed to create shim task: context deadline exceeded
  Warning  BackOff  4m25s (x482 over 132m)  kubelet  Back-off restarting failed container busybox in pod simple-test_coco-pp-e2e-test-94b410d8(171202c1-07d7-4f95-b541-b9dadc10dbbe)

and CAA log shows and error during the CreateContainer (which includes the pull image step):

2024/06/24 16:26:45 [adaptor/proxy]     storages:
2024/06/24 16:26:45 [adaptor/proxy]         mount_point:/run/kata-containers/702a9450e5570a71633834ec4c5f6f407100921862a9015d96038de8518df2f2/rootfs source:quay.io/prometheus/busybox:latest fstype:overlay driver:image_guest_pull
2024/06/24 16:26:49 [adaptor/proxy] CreateContainer fails: context deadline exceeded
time="2024-06-24T16:26:49Z" level=error msg="ttrpc: received message on inactive stream" stream=3603

I need to dig into the kata-agent logs and see if there is any more information about this.

stevenhorsman commented 3 days ago

Looking in the kata-agent log it has the info message

{"msg":"pull image \"docker.io/library/nginx@sha256:9700d098d545f9d2ee0660dfb155fe64f4447720a0a763a93f2cf08997227279\", bundle path \"/run/kata-containers/3a9d18335128ca98c7d1f9d86aaad6922c063eeff135ab977ea164fa5ff60dcf/images\"","level":"INFO","ts":"2024-06-26T13:08:03.03219308Z","name":"kata-agent","subsystem":"image","source":"agent","pid":"810","version":"0.1.0"}

from https://github.com/kata-containers/kata-containers/blob/893fd2b59cc31518f8a127c9611e3e8265d9bdfd/src/agent/src/image.rs#L160

But we never get anything back from image-rs's pull image and then after 60s container fails with context deadline exceeded. Unfortunately image-rs doesn't seem to have any logging, so I'm not sure how to get more information on what is going wrong 😞

bpradipt commented 3 days ago

Looking in the kata-agent log it has the info message

{"msg":"pull image \"docker.io/library/nginx@sha256:9700d098d545f9d2ee0660dfb155fe64f4447720a0a763a93f2cf08997227279\", bundle path \"/run/kata-containers/3a9d18335128ca98c7d1f9d86aaad6922c063eeff135ab977ea164fa5ff60dcf/images\"","level":"INFO","ts":"2024-06-26T13:08:03.03219308Z","name":"kata-agent","subsystem":"image","source":"agent","pid":"810","version":"0.1.0"}

from https://github.com/kata-containers/kata-containers/blob/893fd2b59cc31518f8a127c9611e3e8265d9bdfd/src/agent/src/image.rs#L160

But we never get anything back from image-rs's pull image and then after 60s container fails with context deadline exceeded. Unfortunately image-rs doesn't seem to have any logging, so I'm not sure how to get more information on what is going wrong 😞

If it's using in-guest image pull, then can you try increasing the remote hypervisor timeout and the container create container timeout - https://github.com/kata-containers/kata-containers/blob/main/src/runtime/config/configuration-remote.toml.in#L298 ?

stevenhorsman commented 3 days ago

If it's using in-guest image pull, then can you try increasing the remote hypervisor timeout and the container create container timeout - https://github.com/kata-containers/kata-containers/blob/main/src/runtime/config/configuration-remote.toml.in#L298 ?

Yeah, that's a good idea, but just pulling nginx shouldn't take more that 60s and in the past when I've seen the timeout it's only been on the containerd side, so the kata-agent has still come back for the image pull afterwards, which doesn't seem to be happening here.

stevenhorsman commented 2 days ago

Okay - I stand corrected. It appears that the nginx pull took over 2mins:

Jun 26 13:51:30 podvm-nginx-55954c7c66-vptr5-bc08413b kata-agent[811]: {"msg":"pull image \"docker.io/library/nginx@sha256:9700d098d545f9d2ee0660dfb155fe64f4447720a0a763a93f2cf08997227279\", bundle path \"/run/kata-containers/3c0fc9e0c3634183117f4078d7be48cd3fbb70a8ecc0ea4243cf7cbdf5613aff/images\"","level":"INFO","ts":"2024-06-26T13:51:30.399304775Z","version":"0.1.0","name":"kata-agent","pid":"811","source":"agent","subsystem":"image"}
...
Jun 26 13:53:44 podvm-nginx-55954c7c66-vptr5-bc08413b kata-agent[811]: {"msg":"pull and unpack image \"sha256:dd6c8d4a8748039368f97fd52156d3fadf0ee481dc97d3063d74d9bc38681757\", cid: \"3c0fc9e0c3634183117f4078d7be48cd3fbb70a8ecc0ea4243cf7cbdf5613aff\" succeeded.","level":"INFO","ts":"2024-06-
26T13:53:44.042495884Z","name":"kata-agent","version":"0.1.0","pid":"811","source":"agent","subsystem":"image"}

So I might not have waited long enough, or the containerd request cancelled it or something? So we have an ibmcloud performance issue, rather than functional one. Thanks for nudging me into trying the timeout Pradipta!