Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 308 forks source link

[BUG] AKS Confidential Computing bugs when creating pod using my images #4558

Open blossomin opened 2 months ago

blossomin commented 2 months ago

Describe the bug I want to use AKS confidential computing for my tasks, and I found that when I created pods using my images, the pod failed to create, and if I replace the image in the k8s yaml file, this can be launched. I collected the kata and containerd debug information here, you can use this to debug: https://github.com/blossomin/akslog

inside this logs:

log file name suffix: myworker means using my own worker image, and mcr-pytorch means using("mcr.microsoft.com/azurelinux/base/pytorch:2.2.2-1-azl3.0.20240824-amd64")

by simply comparing the kata logs: I found one statement only exists in the worker_error_kata_myworker.log: cloud-hypervisor: 11.942990s: ERROR:virtio-devices/src/block.rs:814 -- failed to create new AsyncIo: Failed creating a new AsyncIo: Resource temporarily unavailable (os error 11)" It seems this kata-agent dies after this, since this statement followed by many KILL_EVENT received, stopping epoll loop not sure about this, bacause in the kata log from mcr-pytorch also has this KILL_EVENT.

To Reproduce Steps to reproduce the behavior: Currently, my image is internal, so hard to reproduce,

Expected behavior this pod can be launched without any problems using any images

microsoft-github-policy-service[bot] commented 2 months ago

@agowdamsft would you be able to assist?

blossomin commented 1 month ago

One follow up: my worker image has 40 layers, while the "mcr.microsoft.com/azurelinux/base/pytorch:2.2.2-1-azl3.0.20240824-amd64" has about 13 layers.

I guess this issue is: 1) one layer is mapped to one virtio-pci device, 2) only 31 PCI slots per (confdiential)-VM

this causes the resource contention/shortage.