Validating webhook configuration "sgx-validate-config" denies pod creation request for customer workloads with no requested resources

pragyaaneja commented 3 years ago

What happened: As part of the confidential computing addon "ACCSGXDevicePlugin", a validating admission webhook was also deployed. This validating webhook configuration "sgx-validate-config" has a bug in validating customer workloads i.e pod configurations. If a pod is deployed in any non-control-plane namespaces with no requested resources in any of the pod containers, the validating webhook denies the pod creation request and the pod fails to create.

What you expected to happen: Expected the validating webhook configuration to simply validate the pod configurations to ensure that it matches the requirements for the confidential computing workloads.

How to reproduce it (as minimally and precisely as possible): The confidential computing addon version should be confirmed. kubectl describe ds sgx-plugin -n kube-system The container image for the pod should be "mcr.microsoft.com/aks/acc/sgx-plugin:0.2".

Example workload: attested-tls-client.yaml

apiVersion: batch/v1 kind: Job metadata: name: attested-tls-client namespace: default spec: template: metadata: labels: app: attested-tls-client spec: containers:

name: attested-tls-client image: image_name:image_tag command: ["./non_enc_client/tls_non_enc_client"] args:

-server:IP_ADDRESS

-port:12341 restartPolicy: OnFailure backoffLimit: 0

kubectl apply -f attested-tls-client.yaml Warning FailedCreate 0s job-controller Error creating: admission webhook "sgx-validate-config.kube-system.svc" denied the request without explanation

Anything else we need to know?: Customer workaround: A straightforward workaround for customers could be to add a request and limit for any resource to all the containers of their workload yaml file, out of which one container should request EPC memory. As long as one of the containers in the yaml file is requesting EPC resource, and the rest of them are requesting any resource, the workload deployment should succeed.

Requesting EPC resource for one of the containers:

requests: kubernetes.azure.com/sgx_epc_mem_in_MiB: $value limits: kubernetes.azure.com/sgx_epc_mem_in_MiB: $value

Recommended resources for other containers:

requests: memory: $value or cpu: $value limits: memory: $value or cpu: $value

Another workaround would be to disable the ACC (Azure confidential computing) addon before deploying non-confidential workloads on clusters. az aks disable-addons --addons confcom --name $CLUSTER_NAME --resource-group $RESOURCE_GROUP

Environment:

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.14", GitCommit:"5de7fd1f9555368a86eb0f8f664dc58055c17269", GitTreeState:"clean", BuildDate:"2021-01-18T09:31:01Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Addon version: The version of the ACCSGXDevicePlugin i.e confcom addon could be confirmed. kubectl describe ds sgx-plugin -n kube-system The container image for the pod should be "mcr.microsoft.com/aks/acc/sgx-plugin:0.2". If the above command does not give a result or the image version is different, this issue is irrelevant.
Size of cluster (how many worker nodes are in the cluster?) 2
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.): Impacting mainly non-confidential workloads which do not request any resources in any container.
Others:

ghost commented 3 years ago

Hi pragyaaneja, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

pragyaaneja commented 3 years ago

@palma21 @robbiezhang

mattcollier commented 3 years ago

whitlaaa commented 3 years ago

I'm not particularly familiar with SGX, so apologies for the possibly ignorant question. My understanding is that any SGX functionality should be limited to the DCS confidential compute SKUs, but we're seeing the sgx-webhook prevent pod creations/updates as mentioned in this issue and are not leveraging confidential compute in any way (using all D4as nodes currently). Is that expected? I was caught by surprise when that new pod suddenly showed up on our clusters yesterday and started blocking workloads.

agowdamsft commented 3 years ago

@whitlaaa thanks for bringing this to our notice. can you confirm if you have a confidential computing addon enabled on your AKS cluster but not DCSV2 node pools but still seeing issues deploying your workloads?. CC @Pengpeng-Microsoft and @pragyaaneja

whitlaaa commented 3 years ago

@agowdamsft our clusters have the accsgxdeviceplugin added, but I don't see the confcom addon explicitly mentioned anywhere. The related output of az aks show is below.

...
"addonProfiles": {  
    "accsgxdeviceplugin": {         
      "config": {    
        "accsgxquotehelperenabled": "true"
      },                                
      "enabled": true,      
      "identity": null            
    },                         
    "kubedashboard": {
      "config": {},                                                                                                                                                                                                                                                  
      "enabled": false,                             
      "identity": null
    }   
  },                             
  "agentPoolProfiles": [         
    {                            
      "availabilityZones": null,
      "count": 3,     
      "enableAutoScaling": true,
      "enableNodePublicIp": false,                                                                                                                                                                                                                                   
      "maxCount": 6,                                  
      "maxPods": 110,
      "minCount": 3,
      "mode": "System",
      "name": "np3",
      "nodeLabels": {},           
      "nodeTaints": null,
      "orchestratorVersion": "1.18.10",
      "osDiskSizeGb": 80, 
      "osType": "Linux",           
      "provisioningState": "Succeeded",
      "scaleSetEvictionPolicy": null,
      "scaleSetPriority": null,
      "spotMaxPrice": null,                                    
      "tags": null,                                                                                                                                                                                                                                                  
      "type": "VirtualMachineScaleSets",
      "vmSize": "Standard_D4as_v4"
    }                                                 
  ]
  ...

agowdamsft commented 3 years ago

@whitlaaa would you be open to sharing the AKS cluster ID and Azure Sub id privately so we can review internal logs to see what's going on? please email at acconaks@microsoft.com

agowdamsft commented 3 years ago

Hotfix has been in place since 2/25. Existing AKS clusters should have received an update automatically. This issue will be closed. If you see any issues please create open a new issue.

Azure / AKS

Validating webhook configuration "sgx-validate-config" denies pod creation request for customer workloads with no requested resources #2214