Closed pragyaaneja closed 3 years ago
Hi pragyaaneja, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.
I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!
@palma21 @robbiezhang
I'm not particularly familiar with SGX, so apologies for the possibly ignorant question. My understanding is that any SGX functionality should be limited to the DCS confidential compute SKUs, but we're seeing the sgx-webhook
prevent pod creations/updates as mentioned in this issue and are not leveraging confidential compute in any way (using all D4as nodes currently). Is that expected? I was caught by surprise when that new pod suddenly showed up on our clusters yesterday and started blocking workloads.
@whitlaaa thanks for bringing this to our notice. can you confirm if you have a confidential computing addon enabled on your AKS cluster but not DCSV2 node pools but still seeing issues deploying your workloads?. CC @Pengpeng-Microsoft and @pragyaaneja
@agowdamsft our clusters have the accsgxdeviceplugin
added, but I don't see the confcom
addon explicitly mentioned anywhere. The related output of az aks show
is below.
...
"addonProfiles": {
"accsgxdeviceplugin": {
"config": {
"accsgxquotehelperenabled": "true"
},
"enabled": true,
"identity": null
},
"kubedashboard": {
"config": {},
"enabled": false,
"identity": null
}
},
"agentPoolProfiles": [
{
"availabilityZones": null,
"count": 3,
"enableAutoScaling": true,
"enableNodePublicIp": false,
"maxCount": 6,
"maxPods": 110,
"minCount": 3,
"mode": "System",
"name": "np3",
"nodeLabels": {},
"nodeTaints": null,
"orchestratorVersion": "1.18.10",
"osDiskSizeGb": 80,
"osType": "Linux",
"provisioningState": "Succeeded",
"scaleSetEvictionPolicy": null,
"scaleSetPriority": null,
"spotMaxPrice": null,
"tags": null,
"type": "VirtualMachineScaleSets",
"vmSize": "Standard_D4as_v4"
}
]
...
@whitlaaa would you be open to sharing the AKS cluster ID and Azure Sub id privately so we can review internal logs to see what's going on? please email at acconaks@microsoft.com
Hotfix has been in place since 2/25. Existing AKS clusters should have received an update automatically. This issue will be closed. If you see any issues please create open a new issue.
What happened: As part of the confidential computing addon "ACCSGXDevicePlugin", a validating admission webhook was also deployed. This validating webhook configuration "sgx-validate-config" has a bug in validating customer workloads i.e pod configurations. If a pod is deployed in any non-control-plane namespaces with no requested resources in any of the pod containers, the validating webhook denies the pod creation request and the pod fails to create.
What you expected to happen: Expected the validating webhook configuration to simply validate the pod configurations to ensure that it matches the requirements for the confidential computing workloads.
How to reproduce it (as minimally and precisely as possible): The confidential computing addon version should be confirmed.
kubectl describe ds sgx-plugin -n kube-system
The container image for the pod should be "mcr.microsoft.com/aks/acc/sgx-plugin:0.2".Example workload: attested-tls-client.yaml
kubectl apply -f attested-tls-client.yaml
Warning FailedCreate 0s job-controller Error creating: admission webhook "sgx-validate-config.kube-system.svc" denied the request without explanation
Anything else we need to know?: Customer workaround: A straightforward workaround for customers could be to add a request and limit for any resource to all the containers of their workload yaml file, out of which one container should request EPC memory. As long as one of the containers in the yaml file is requesting EPC resource, and the rest of them are requesting any resource, the workload deployment should succeed.
Requesting EPC resource for one of the containers:
Recommended resources for other containers:
Another workaround would be to disable the ACC (Azure confidential computing) addon before deploying non-confidential workloads on clusters.
az aks disable-addons --addons confcom --name $CLUSTER_NAME --resource-group $RESOURCE_GROUP
Environment:
kubectl version
): Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.14", GitCommit:"5de7fd1f9555368a86eb0f8f664dc58055c17269", GitTreeState:"clean", BuildDate:"2021-01-18T09:31:01Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}kubectl describe ds sgx-plugin -n kube-system
The container image for the pod should be "mcr.microsoft.com/aks/acc/sgx-plugin:0.2". If the above command does not give a result or the image version is different, this issue is irrelevant.