Open blossomin opened 2 months ago
@agowdamsft would you be able to assist?
One follow up: my worker image has 40 layers, while the "mcr.microsoft.com/azurelinux/base/pytorch:2.2.2-1-azl3.0.20240824-amd64" has about 13 layers.
I guess this issue is: 1) one layer is mapped to one virtio-pci device, 2) only 31 PCI slots per (confdiential)-VM
this causes the resource contention/shortage.
Describe the bug I want to use AKS confidential computing for my tasks, and I found that when I created pods using my images, the pod failed to create, and if I replace the image in the k8s yaml file, this can be launched. I collected the kata and containerd debug information here, you can use this to debug: https://github.com/blossomin/akslog
inside this logs:
log file name suffix: myworker means using my own worker image, and mcr-pytorch means using("mcr.microsoft.com/azurelinux/base/pytorch:2.2.2-1-azl3.0.20240824-amd64")
by simply comparing the kata logs: I found one statement only exists in the worker_error_kata_myworker.log: cloud-hypervisor: 11.942990s: ERROR:virtio-devices/src/block.rs:814 -- failed to create new AsyncIo: Failed creating a new AsyncIo: Resource temporarily unavailable (os error 11)"
It seems this kata-agent dies after this, since this statement followed by many KILL_EVENT received, stopping epoll loop
not sure about this, bacause in the kata log from mcr-pytorch also has this KILL_EVENT.
To Reproduce Steps to reproduce the behavior: Currently, my image is internal, so hard to reproduce,
Expected behavior this pod can be launched without any problems using any images