confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
48 stars 88 forks source link

csi-wrapper: azuredisk-csi-driver support #2122

Closed daniel-weisse closed 2 weeks ago

daniel-weisse commented 1 month ago

As a follow up to https://github.com/confidential-containers/cloud-api-adaptor/pull/2108 and https://github.com/confidential-containers/cloud-api-adaptor/pull/2106, this PR adds the required changes to enable the csi-wrapper for the azuredisk-csi-driver. Changes are required in two places:

It also includes examples for using the azuredisk-csi-driver to create a Pod consuming a dynamically provisioned PVC or a statically provisioned PVC.

daniel-weisse commented 1 month ago

do you have any specific "how-to-test" instructions?

Following the instructions from the README should be enough

edit: I just doubled checked against the official CAA set up for Azure, which deploys Peer Pod VMs in its own separate resource group. This will cause issues because the CSI driver cannot attach disks it created in the AKS resource group to the VMs in the other resource group. To fix this, change all references of $AZURE_RESOURCE_GROUP in the azure deployment guide to $AKS_RG.

I'll check if its easily possible to configure the csi driver to create the disks outside the AKS resource group

mkulke commented 1 month ago

do you have any specific "how-to-test" instructions?

Following the instructions from the README should be enough

edit: I just doubled checked against the official CAA set up for Azure, which deploys Peer Pod VMs in its own separate resource group. This will cause issues because the CSI driver cannot attach disks it created in the AKS resource group to the VMs in the other resource group. To fix this, change all references of $AZURE_RESOURCE_GROUP in the azure deployment guide to $AKS_RG.

I'll check if its easily possible to configure the csi driver to create the disks outside the AKS resource group

thanks for the headsup. the $AKS_RG is sort of managed, i.e. if a cluster is removed the RG is discarded, too. So, we avoid putting resources in that RG if that's feasible. especially for disks we might not want this

daniel-weisse commented 1 month ago

Managed to come up with a somewhat clean solution that allows following the documented way of setting up CAA on AKS. The PR should now be fully functional following the instructions from the docs for Azure and the README from this PR

mkulke commented 4 weeks ago

I'm currently following the guide using a fresh AKS + CAA installation from main and I don't seem to get the azuredisk-csi-driver to work (Deploy azuredisk-csi-driver on the cluster + Option A). Somehow the mount is not set up properly:

for the azurefile pv:

kubectl exec nginx-pv -c nginx -- mount | grep mount-path
//....file.core.windows.net/pvc-40dd3dcf-603c-4877-a3e4-b489482bfc44 on /mount-path type cifs (rw,relatime,vers=3.1.1,cache=strict,username=...,uid=0,noforceuid,gid=0,noforcegid,addr=...,file_mode=0777,dir_mode=0777,soft,persistenthandles,nounix,serverino,mapposix,mfsymlinks,reparse=nfs,rsize=1048576,wsize=1048576,bsize=1048576,retrans=1,echo_interval=60,nosharesock,actimeo=30,closetimeo=1)

for azuredisk pv:

$ kubectl exec nginx-pv-disk -c nginx -- mount | grep mount-path
tmpfs on /mount-path type tmpfs (rw,relatime,size=1912876k,nr_inodes=1048576,mode=755,inode64)

the disk is attached to the podvm:

az disk show -n pvc-e1355511-b329-47e2-b539-1a8600eb5930 -g mgns | jq -c '[.diskState, .managedBy]'
["Attached","/subscriptions/..../resourceGroups/mgns/providers/Microsoft.Compute/virtualMachines/podvm-nginx-pv-disk-f6e92b43"]
daniel-weisse commented 4 weeks ago

The mount path being a tempfs seems very strange to me and its a problem I haven't seen before while trying to get the azure driver running. I used a script to set up my cluster, so its possible I missed something. I'll try to investigate this more tomorrow

mkulke commented 4 weeks ago

The mount path being a tempfs seems very strange to me and its a problem I haven't seen before while trying to get the azure driver running. I used a script to set up my cluster, so its possible I missed something. I'll try to investigate this more tomorrow

I'll leave my cluster in this state, feel free to reach out on slack for debugging

daniel-weisse commented 3 weeks ago

Squashed the last 3 commits, should be good to merge now once test ran again

bpradipt commented 2 weeks ago

The csi-wrapper failure should get fixed with rebase as the base image has changed. I'm rebasing and merging this