confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
48 stars 85 forks source link

Error following Azure quickstart #1146

Open octaviansima opened 1 year ago

octaviansima commented 1 year ago

Hi, I've been working on getting an instance of the CoCo operator up and running on AKS. Now running into an issue trying to run the sample application where the CCA has some gRPC issue. This is visible in the logs of the cloud-api-adaptor-daemonset pod:

(.venv) os@cdev:~/code/opaque/ai/scripts/operator/cloud$ k logs pod/cloud-api-adaptor-daemonset-t4qpr -n confidential-containers-system
+ exec cloud-api-adaptor azure -subscriptionid f4b97aef-ef10-4d81-88da-0f6544771ed2 -region westus -instance-size Standard_DC2as_v5 -resourcegroup coco-testing-rg -vxlan-port 8472 -subnetid /subscriptions/f4b97aef-ef10-4d81-88da-0f6544771ed2/resourceGroups/coco-testing-aks-rg1/providers/Microsoft.Network/virtualNetworks/aks-vnet-30557922/subnets/aks-subnet -securitygroupid '' -imageid /subscriptions/f4b97aef-ef10-4d81-88da-0f6544771ed2/resourceGroups/coco-testing-rg/providers/Microsoft.Compute/galleries/caaubntcvmsGallery/images/cc-image/versions/0.0.1
cloud-api-adaptor version v0.6.0-dev
  commit: b5b55f2f4f5fb1a725d378677dec0adb2361bc26
  go: go1.20.5
cloud-api-adaptor: starting Cloud API Adaptor daemon for "azure"
2023/07/05 19:20:54 [adaptor/cloud/azure] azure config {SubscriptionId:f4b97aef-ef10-4d81-88da-0f6544771ed2 ClientId:********** ClientSecret:********** TenantId:********** ResourceGroupName:coco-testing-rg Zone: Region:westus SubnetId:/subscriptions/f4b97aef-ef10-4d81-88da-0f6544771ed2/resourceGroups/coco-testing-aks-rg1/providers/Microsoft.Network/virtualNetworks/aks-vnet-30557922/subnets/aks-subnet SecurityGroupName: SecurityGroupId: Size:Standard_DC2as_v5 ImageId:/subscriptions/f4b97aef-ef10-4d81-88da-0f6544771ed2/resourceGroups/coco-testing-rg/providers/Microsoft.Compute/galleries/caaubntcvmsGallery/images/cc-image/versions/0.0.1 SSHKeyPath:$HOME/.ssh/id_rsa.pub SSHUserName:peerpod DisableCVM:false InstanceSizes:[] InstanceSizeSpecList:[]}
2023/07/05 19:20:55 [adaptor/cloud/azure] instanceSizeSpecList ([{Standard_DC2as_v5 2 8192  0}])
2023/07/05 19:20:55 [adaptor] server config: &adaptor.ServerConfig{TLSConfig:(*tlsutil.TLSConfig)(0xc000591b80), SocketPath:"/run/peerpod/hypervisor.sock", CriSocketPath:"", PauseImage:"", PodsDir:"/run/peerpod/pods", ForwarderPort:"15150", ProxyTimeout:300000000000}
2023/07/05 19:20:55 [util/k8sops] initialized PeerPodService
2023/07/05 19:20:55 [adaptor] server started
2023/07/05 19:27:32 [podnetwork] routes on netns /var/run/netns/cni-d814137d-01fc-ee42-fa58-152bd4cd7ce5
2023/07/05 19:27:32 [podnetwork]     default via 10.244.0.1 dev eth0
2023/07/05 19:27:32 [adaptor/cloud] stored /run/peerpod/pods/9bf252c2d522a965576720ee9e74f0350d2dcbc6dd555d551a329f607199b218/daemon.json
2023/07/05 19:27:32 [adaptor/cloud] create a sandbox 9bf252c2d522a965576720ee9e74f0350d2dcbc6dd555d551a329f607199b218 for pod nginx-6c48b665d4-t27qq in namespace default (netns: /var/run/netns/cni-d814137d-01fc-ee42-fa58-152bd4cd7ce5)
Using default instance type ("Standard_DC2as_v5")
2023/07/05 19:27:33 [adaptor/cloud/azure] CreateInstance: name: "podvm-nginx-6c48b665d4-t27qq-9bf252c2"
2023/07/05 19:28:19 [adaptor/cloud/azure] created VM successfully: /subscriptions/f4b97aef-ef10-4d81-88da-0f6544771ed2/resourceGroups/coco-testing-rg/providers/Microsoft.Compute/virtualMachines/podvm-nginx-6c48b665d4-t27qq-9bf252c2
2023/07/05 19:28:19 [adaptor/cloud/azure] podNodeIP[0]=10.224.0.5
2023/07/05 19:28:19 [adaptor/cloud] failed to create PeerPod: the server could not find the requested resource (post peerPods.confidentialcontainers.org)
2023/07/05 19:28:19 [adaptor/cloud] created an instance podvm-nginx-6c48b665d4-t27qq-9bf252c2 for sandbox 9bf252c2d522a965576720ee9e74f0350d2dcbc6dd555d551a329f607199b218
2023/07/05 19:28:19 [tunneler/vxlan] vxlan ppvxlan1 (remote 10.224.0.5:8472, id: 555000) created at /proc/1/task/10/ns/net
2023/07/05 19:28:19 [tunneler/vxlan] vxlan ppvxlan1 created at /proc/1/task/10/ns/net
2023/07/05 19:28:19 [tunneler/vxlan] vxlan ppvxlan1 is moved to /var/run/netns/cni-d814137d-01fc-ee42-fa58-152bd4cd7ce5
2023/07/05 19:28:19 [tunneler/vxlan] Add tc redirect filters between eth0 and vxlan1 on pod network namespace /var/run/netns/cni-d814137d-01fc-ee42-fa58-152bd4cd7ce5
2023/07/05 19:28:19 [adaptor/proxy] Listening on /run/peerpod/pods/9bf252c2d522a965576720ee9e74f0350d2dcbc6dd555d551a329f607199b218/agent.ttrpc
2023/07/05 19:28:19 [adaptor/proxy] failed to init cri client, the err: cri runtime endpoint is not specified, it is used to get the image name from image digest
2023/07/05 19:28:19 [adaptor/proxy] Trying to establish agent proxy connection to 10.224.0.5:15150
2023/07/05 19:28:19 [adaptor/proxy] established agent proxy connection to 10.224.0.5:15150
2023/07/05 19:28:19 [adaptor/cloud] agent proxy is ready
2023/07/05 19:28:19 [adaptor/proxy] CreateSandbox: hostname:nginx-6c48b665d4-t27qq sandboxId:9bf252c2d522a965576720ee9e74f0350d2dcbc6dd555d551a329f607199b218
2023/07/05 19:28:19 [adaptor/proxy]     storages:
2023/07/05 19:28:19 [adaptor/proxy]         mountpoint:/run/kata-containers/sandbox/shm source:shm fstype:tmpfs driver:ephemeral
2023/07/05 19:28:19 [adaptor/proxy] CreateContainer: containerID:9bf252c2d522a965576720ee9e74f0350d2dcbc6dd555d551a329f607199b218
2023/07/05 19:28:19 [adaptor/proxy]     mounts:
2023/07/05 19:28:19 [adaptor/proxy]         destination:/proc source:proc type:proc
2023/07/05 19:28:19 [adaptor/proxy]         destination:/dev source:tmpfs type:tmpfs
2023/07/05 19:28:19 [adaptor/proxy]         destination:/dev/pts source:devpts type:devpts
2023/07/05 19:28:19 [adaptor/proxy]         destination:/dev/shm source:/run/kata-containers/sandbox/shm type:bind
2023/07/05 19:28:19 [adaptor/proxy]         destination:/dev/mqueue source:mqueue type:mqueue
2023/07/05 19:28:19 [adaptor/proxy]         destination:/sys source:sysfs type:sysfs
2023/07/05 19:28:19 [adaptor/proxy]         destination:/dev/shm source:/run/kata-containers/sandbox/shm type:bind
2023/07/05 19:28:19 [adaptor/proxy]         destination:/etc/resolv.conf source:/run/kata-containers/shared/containers/9bf252c2d522a965576720ee9e74f0350d2dcbc6dd555d551a329f607199b218-bf92b69ebdd1bae2-resolv.conf type:bind
2023/07/05 19:28:19 [adaptor/proxy]     annotations:
2023/07/05 19:28:19 [adaptor/proxy]         io.kubernetes.cri.sandbox-log-directory: /var/log/pods/default_nginx-6c48b665d4-t27qq_523fd2b9-f840-459a-89b0-162ddf7439a2
2023/07/05 19:28:19 [adaptor/proxy]         io.katacontainers.pkg.oci.container_type: pod_sandbox
2023/07/05 19:28:19 [adaptor/proxy]         io.katacontainers.pkg.oci.bundle_path: /run/containerd/io.containerd.runtime.v2.task/k8s.io/9bf252c2d522a965576720ee9e74f0350d2dcbc6dd555d551a329f607199b218
2023/07/05 19:28:19 [adaptor/proxy]         io.kubernetes.cri.sandbox-id: 9bf252c2d522a965576720ee9e74f0350d2dcbc6dd555d551a329f607199b218
2023/07/05 19:28:19 [adaptor/proxy]         io.kubernetes.cri.sandbox-cpu-shares: 2
2023/07/05 19:28:19 [adaptor/proxy]         io.kubernetes.cri.sandbox-cpu-quota: 0
2023/07/05 19:28:19 [adaptor/proxy]         io.kubernetes.cri.sandbox-namespace: default
2023/07/05 19:28:19 [adaptor/proxy]         io.kubernetes.cri.sandbox-memory: 0
2023/07/05 19:28:19 [adaptor/proxy]         nerdctl/network-namespace: /var/run/netns/cni-d814137d-01fc-ee42-fa58-152bd4cd7ce5
2023/07/05 19:28:19 [adaptor/proxy]         io.kubernetes.cri.sandbox-name: nginx-6c48b665d4-t27qq
2023/07/05 19:28:19 [adaptor/proxy]         io.kubernetes.cri.container-type: sandbox
2023/07/05 19:28:19 [adaptor/proxy]         io.kubernetes.cri.sandbox-cpu-period: 100000
2023/07/05 19:28:19 [adaptor/proxy] getImageName: no pause image specified uses default pause image: registry.k8s.io/pause:3.7
2023/07/05 19:28:19 [adaptor/proxy] CreateContainer: calling PullImage for "registry.k8s.io/pause:3.7" before CreateContainer (cid: "9bf252c2d522a965576720ee9e74f0350d2dcbc6dd555d551a329f607199b218")
2023/07/05 19:28:19 [adaptor/proxy] CreateContainer: failed to call PullImage, probably because the image has already been pulled. ignored: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
2023/07/05 19:28:19 [adaptor/proxy] CreateContainer: failed to call PullImage, probably because the image has already been pulled. ignored: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
2023/07/05 19:28:20 [adaptor/proxy] CreateContainer: failed to call PullImage, probably because the image has already been pulled. ignored: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
2023/07/05 19:28:20 [adaptor/proxy] CreateContainer: failed to call PullImage, probably because the image has already been pulled. ignored: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
2023/07/05 19:28:21 [adaptor/proxy] CreateContainer: failed to call PullImage, probably because the image has already been pulled. ignored: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
2023/07/05 19:28:23 [adaptor/proxy] CreateContainer: failed to call PullImage, probably because the image has already been pulled. ignored: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
2023/07/05 19:28:26 [adaptor/proxy] CreateContainer: failed to call PullImage, probably because the image has already been pulled. ignored: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
2023/07/05 19:28:32 [adaptor/proxy] CreateContainer: failed to call PullImage, probably because the image has already been pulled. ignored: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
2023/07/05 19:28:45 [adaptor/proxy] CreateContainer: failed to call PullImage, probably because the image has already been pulled. ignored: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
2023/07/05 19:29:11 [adaptor/proxy] CreateContainer: failed to call PullImage, probably because the image has already been pulled. ignored: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
2023/07/05 19:29:11 [adaptor/proxy] PullImage fails: All attempts fail:
#1: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
#2: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
#3: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
#4: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
#5: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
#6: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
#7: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
#8: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
#9: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
#10: rpc error: code = InvalidArgument desc = grpc.Image service does not exist
2023/07/05 19:29:11 [adaptor/proxy] DestroySandbox
2023/07/05 19:29:11 [adaptor/proxy] shutting down socket forwarder
2023/07/05 19:29:51 [adaptor/cloud/azure] deleted VM successfully: podvm-nginx-6c48b665d4-t27qq-9bf252c2
2023/07/05 19:30:21 [adaptor/cloud/azure] deleted disk successfully: podvm-nginx-6c48b665d4-t27qq-9bf252c2-disk
2023/07/05 19:30:26 [adaptor/cloud/azure] deleted network interface successfully: podvm-nginx-6c48b665d4-t27qq-9bf252c2-net

I confirmed that both the webhook and CoCo runtime are installed on the cluster just fine. Any idea what could be going on here?

yoheiueda commented 1 year ago

I just encountered the same issue in IBM Cloud. I am investigating it now.

yoheiueda commented 1 year ago

In my case, I built the pod VM image based on a wrong kata-containers branch. So, the kata-agent included in my pod VM image did not support peer pods. After rebuilding the pod VM image with the Kata Containers CCv0 branch, the problem is gone.

@octaviansima How did you prepare your Pod VM image and set AZURE_IMAGE_ID? https://github.com/confidential-containers/cloud-api-adaptor/tree/main/azure#build-pod-vm-image

bpradipt commented 1 year ago

The following error typically indicates wrong kata-agent without the code for image pull. So looks like the kata-agent is not built from the CCv0 branch

code = InvalidArgument desc = grpc.Image service does not exist
octaviansima commented 1 year ago

Thanks for the help, using that branch fixed that error but now I'm running into

 Warning  FailedCreatePodSandBox  22m                   kubelet            
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: 
remote hypervisor call failed: rpc error: code = Unknown desc = creating an instance : parsing pod node IP "10.224.0.5": %!w(<nil>): unknown

This repeats indefinitely as K8s attempts to launch another pod until I hit a VM quota limit on Azure

octaviansima commented 1 year ago

Looks like the code responsible for that error message was changed recently @yoheiueda https://github.com/confidential-containers/cloud-api-adaptor/commit/2af7abc010a6e8305013e16fc8b7c6617b965d64

yoheiueda commented 1 year ago

Looks like I introduced a bug in the Azure provider by #1018. Sorry for the inconvenience.

I raised PR #1151 to fix the problem.

yoheiueda commented 1 year ago

@octaviansima The fix has been merged. Could you please try it again?

octaviansima commented 1 year ago

Deploy unencrypted images works, thanks! However running into the following error in the pod logs for an encrypted image (following the same workflow from the quickstart):

Failed to pull image "docker.io/osima/encrypt-testing:encrypted_busybox":
rpc error: code = Unknown desc = failed to pull and unpack image 
"docker.io/osima/encrypt-testing:encrypted_busybox": failed to extract layer sha256:feb4513d4fb7052bcff38021fc9ef82fd409f4e016f3dff5c20ff5645cde4c02: 
failed to get stream processor for application/vnd.oci.image.layer.v1.tar+gzip+encrypted: exec: "ctd-decoder":
executable file not found in $PATH: unknown

I was also running into an error related to being unable to fetch decryption keys, and now I'm trying to test that that was solved by setting AA_KBC and KBC_URI before creating the pod VM image with make image. However, this new error seems to be higher up on the stack...

mkulke commented 1 year ago

@octaviansima

I looked at the image w/ skopeo inspect docker://docker.io/osima/encrypt-testing:encrypted_busybox. It looks like it's a different sha than the one you posted above. it seems to have correct coco annotations.

for an image to work it needs to be encrypted with coco_keyprovider, which is accessed by ocicrypt-rs via GRPC api (here is a snippet from a workflow doing this). Similarly at image pull image-rs will call out to attestation-agent via GRPC.

$ skopeo inspect docker://docker.io/osima/encrypt-testing:encrypted_busybox | jq '.LayersData[0].Annotations."org.opencontainers.image.enc.keys.provider.attestation-agent" | @base64d | fromjson'
{
  "kid": "kbs:///default/image-kek/17d19f0e-6447-4a96-8e9e-cd22761934b5",
  "wrapped_data": "yrBr+zk/iMbdZBK9v82wf61rANPvHANHBSQ3D3pJ0mtXk3lEbQzOAJ5wfXyDztlgHKFBl4wowWrX7VpS6xFG5J3zo2CR0noDB9E+cOm13JeouKaELbvB4xnwQf60gt6EUyfPdqI0rCnMbVON/aPpcuPZzZHIUv9xLhp69zmHswxCHtdWObOMzKuE77DnGasJvmezukSP1OTHlLZ8sMx+nFokF/8bxzCEVgCUL5dzQcb2s0Pgp1c1yg8YF4SB4A68tZ58pYvfdy6t9YomCuxQ4lU=",
  "iv": "8g4fUKE7z/mF6YwB",
  "wrap_type": "A256GCM"
}
octaviansima commented 1 year ago

@mkulke I believe I am encrypting with coco_keyprovider -- my ocicrypt.conf is as follows:

{
    "key-providers": {
        "attestation-agent": {
            "grpc": "127.0.0.1:50000"
        }
    }
}

and I do notice this output from the KBS cluster when I call skopeo copy ... --encryption-key provider:attestation-agent:cc_kbc::http://127.0.0.1:8080 ...:

infra-kbs-1          | [2023-07-11T22:14:30Z INFO  actix_web::middleware::logger] 172.19.0.2 "POST /kbs/v0/resource/default/image-kek/83a09951-16c9-4ea8-aee7-d56b9be149f6 HTTP/1.1" 200 0 "-" "-" 0.000599
infra-keyprovider-1  | [2023-07-11T22:14:30Z INFO  coco_keyprovider::enc_mods] register KEK succeeded.

I wonder if it has anything to do with the root error in the logs exec: "ctd-decoder": executable file not found in $PATH: unknown

mkulke commented 1 year ago

hmm, it looks it's falling back to some default decryption option. I think you do need to set AA_KBC and KBS_URI. The image annotation itself would not know which KBS to talk to (it has kbs:///default set) and w/o setting the AA_KBC params, attestation-agent might not start.

The respective kata-agent code is here

It's not trivial to test the image-decr functionality. the ttrpc/grpc endpoint in attestation-agent is an implementation of the ocicrypt plugin to decrypt layers, but you can check whether the attestation-agent process has been started w/ the correct configuration options.

So I'd recommend starting a peer-pod w/ an unencrypted image and shelling into the podvm. The attestation-agent process should be running on the podvm, also for unencrypted images.

octaviansima commented 1 year ago

When I create a VM from the generated image (I can't figure out how to SSH into a pod VM), attestation-agent process isn't running at all. ExecStart found in /etc/systemd/system/attestation-agent.service (/usr/local/bin/attestation-agent --getresource_sock 127.0.0.1:50001 ), yields

/usr/local/bin/attestation-agent: error while loading shared libraries: libtdx_attest.so.1: cannot open shared object file: No such file or directory

and when I download those libraries:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: clean previous getresource socket file

Caused by:
    socket address scheme is not expected', app/src/main.rs:29:39
mkulke commented 1 year ago

I can't figure out how to SSH into a pod VM

In the caa configmap you specify a pub key for the podvm instances. what you can do is to start an unencrypted image (say "nginx:stable", so the pod starts up properly and the VM isn't killed by CAA). Since there is networking between the k8s node and the podvm, you should be able to use a node as intermediate jump host (using agent-forwarding or sth) to the podvm

libtdx_attest.so.1: cannot open shared object file: No such file or directory

This is a bug in the podvm image building process, the package should be installed, otherwise attestation-agent cannot run. Are you using a tagged release (v0.6)? If you add the required package to misc-settings.sh that might fix your issue

attestation-agent process isn't running at all.

that's expected. attestation-agent for image decryption is spawned manually by kata-agent when pulling images. the systemd unit is for secret release at runtime, which is currently WIP and needs manual tweaking to work: attestation-agent is compiled with ttrpc, which allows only domain socket communication.

You can still test it by providing a filename as a socket: /usr/local/bin/attestation-agent --help