Azure / azure-workload-identity

Azure AD Workload Identity uses Kubernetes primitives to associate managed identities for Azure resources and identities in Azure Active Directory (AAD) with pods.
https://azure.github.io/azure-workload-identity
MIT License
298 stars 95 forks source link

Errors for azure-wi-webhook-controller-manager pods on install: User \"system:serviceaccount:azure-workload-identity-system:azure-wi-webhook-admin\" cannot update resource \"mutatingwebhookconfigurations\" in API group \"admissionregistration.k8s.io\" at the cluster scope: Azure does not have opinion for this user. #856

Closed VioletHynes closed 1 year ago

VioletHynes commented 1 year ago

Describe the bug

Hi there! I'm trying to set up a WIF enabled cluster. Here are all of the steps I've done so far:

The above completed without issue (though the documentation was fairly scattered). The only step I think I'm missing is installing the webhook.

When I install the webhook through either of the approaches indicated here: https://azure.github.io/azure-workload-identity/docs/installation/mutating-admission-webhook.html - the two pods in the azure-workload-identity-system namespace are erroring with that error, and do not seem to be injecting anything into my annotated pods.

This has been a clean install each time, and I've made sure to clean up each time.

I found a similar issue here: #777 but reinstallation doesn't fix it for me. I've tried many times to reinstall and always get this issue.

There could be something I'm missing, but this is a fresh workload-identity-webhook install on a fairly fresh (created last week) AKS cluster, so I kind of would expect this to 'just work' since this is meant to be the new way to do things. If there is something I'm missing, do let me know!

Steps To Reproduce

Install WIF admissions webhook using either of the approaches outlined here: https://azure.github.io/azure-workload-identity/docs/installation/mutating-admission-webhook.html

I'm using the latest Helm chart (I've done helm repo update) on an AKS cluster I made last week.

Expected behavior

I shouldn't get errors when installing WIF into an AKS environment.

Logs

{"level":"error","timestamp":"2023-04-19T18:57:18.292265Z","caller":"/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326$controller.(*Controller).reconcileHandler","message":"Reconciler error","controller":"cert-rotator","object":{"name":"azure-wi-webhook-server-cert","namespace":"azure-workload-identity-system"},"namespace":"azure-workload-identity-system","name":"azure-wi-webhook-server-cert","reconcileID":"b1c518a5-15f6-4e17-a16c-89809d2645ac","error":"mutatingwebhookconfigurations.admissionregistration.k8s.io \"azure-wi-webhook-mutating-webhook-configuration\" is forbidden: User \"system:serviceaccount:azure-workload-identity-system:azure-wi-webhook-admin\" cannot update resource \"mutatingwebhookconfigurations\" in API group \"admissionregistration.k8s.io\" at the cluster scope: Azure does not have opinion for this user."}
{"level":"info","timestamp":"2023-04-19T19:00:02.133676Z","logger":"cert-rotation","caller":"/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.5.0/pkg/rotator/rotator.go:722$rotator.(*ReconcileWH).ensureCerts","message":"Ensuring CA cert","name":"azure-wi-webhook-mutating-webhook-configuration","gvk":"admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration","name":"azure-wi-webhook-mutating-webhook-configuration","gvk":"admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration"}
{"level":"error","timestamp":"2023-04-19T19:00:02.209346Z","logger":"cert-rotation","caller":"/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.5.0/pkg/rotator/rotator.go:729$rotator.(*ReconcileWH).ensureCerts","message":"Error updating webhook with certificate","name":"azure-wi-webhook-mutating-webhook-configuration","gvk":"admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration","error":"mutatingwebhookconfigurations.admissionregistration.k8s.io \"azure-wi-webhook-mutating-webhook-configuration\" is forbidden: User \"system:serviceaccount:azure-workload-identity-system:azure-wi-webhook-admin\" cannot update resource \"mutatingwebhookconfigurations\" in API group \"admissionregistration.k8s.io\" at the cluster scope: Azure does not have opinion for this user."}
{"level":"error","timestamp":"2023-04-19T19:00:02.209888Z","caller":"/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326$controller.(*Controller).reconcileHandler","message":"Reconciler error","controller":"cert-rotator","object":{"name":"azure-wi-webhook-server-cert","namespace":"azure-workload-identity-system"},"namespace":"azure-workload-identity-system","name":"azure-wi-webhook-server-cert","reconcileID":"ba47bcab-db12-484a-8021-4a81848eaa09","error":"mutatingwebhookconfigurations.admissionregistration.k8s.io \"azure-wi-webhook-mutating-webhook-configuration\" is forbidden: User \"system:serviceaccount:azure-workload-identity-system:azure-wi-webhook-admin\" cannot update resource \"mutatingwebhookconfigurations\" in API group \"admissionregistration.k8s.io\" at the cluster scope: Azure does not have opinion for this user."}

Environment

AKS

Additional context

I'm looking to get my environment working so I can test a change to Vault Agent to support WIF authentication for Vault Agent.

aramase commented 1 year ago

There could be something I'm missing, but this is a fresh workload-identity-webhook install on a fairly fresh (created last week) AKS cluster, so I kind of would expect this to 'just work' since this is meant to be the new way to do things. If there is something I'm missing, do let me know!

If you're enabling the addon --enable-workload-identity, you don't have to install the webhook again from this repo. The add-on is a managed version of this project and when you run --enable-workload-identity, AKS will deploy the webhook in kube-system namespace.

VioletHynes commented 1 year ago

Ah. The WIF troubleshooting documentation suggested I debug in the azure-workload-identity-system namespace, which wasn't populated at all: https://azure.github.io/azure-workload-identity/docs/troubleshooting.html - the other documentation (e.g. https://learn.microsoft.com/en-us/azure/aks/workload-identity-deploy-cluster) doesn't really mention troubleshooting steps or that it installs these resources or where.

I'm not entirely convinced that the one that was installed by default is working right now either, but now that I know where it is, I can at the very least look at the logs a bit and understand why.

Would you suggest that I uninstall the helm chart, and should the one installed by --enable-workload-identity be good enough?

aramase commented 1 year ago

Ah. The WIF troubleshooting documentation suggested I debug in the azure-workload-identity-system namespace, which wasn't populated at all: https://azure.github.io/azure-workload-identity/docs/troubleshooting.html - the other documentation (e.g. https://learn.microsoft.com/en-us/azure/aks/workload-identity-deploy-cluster) doesn't really mention troubleshooting steps or that it installs these resources or where.

The troubleshooting docs in this repo are specific to the helm chart installation. Thanks for point out the missing section in the AKS docs. There is room for improvement here.

Would you suggest that I uninstall the helm chart, and should the one installed by --enable-workload-identity be good enough?

Yes, you can uninstall the helm chart. Only a single instance of the webhook is required.

VioletHynes commented 1 year ago

It has been quite confusing to troubleshoot WIF and those were the only troubleshooting docs I could find (e.g. they're the top google result for "workload identity federation troubleshooting azure") - it might be helpful to also note in the troubleshooting docs where to find the troubleshooting docs for WIF that doesn't use the helm chart, or some information that's not specific to the helm chart information. Nothing on the docs that I can see would indicate it doesn't apply for AKS.

The only other docs I've seen (as part of the error message I get when the request to http://169.254.169.254/metadata/identity/oauth2/token fails) is this one, which doesn't mention WIF at all: https://aka.ms/azsdk/go/identity/troubleshoot#managed-id

Thanks for the help, though! I appreciate the help and information greatly. Feel free to close this. I'm still surprised the resource errored in the way it did on a fresh install, but ultimately I'm not blocked by it any more.

aramase commented 1 year ago

It has been quite confusing to troubleshoot WIF and those were the only troubleshooting docs I could find (e.g. they're the top google result for "workload identity federation troubleshooting azure") - it might be helpful to also note in the troubleshooting docs where to find the troubleshooting docs for WIF that doesn't use the helm chart, or some information that's not specific to the helm chart information.

There should be a separate troubleshooting guide in the AKS docs. @miwithro @karataliu could you'll track this?

The only other docs I've seen (as part of the error message I get when the request to http://169.254.169.254/metadata/identity/oauth2/token fails) is this one, which doesn't mention WIF at all: https://aka.ms/azsdk/go/identity/troubleshoot#managed-id

This means the workload is using an old version of sdk which still relies on IMDS to get a managed identity token. Here are the minimum required SDK versions for workload identity: https://azure.github.io/azure-workload-identity/docs/topics/language-specific-examples/azure-identity-sdk.html.

I'm still surprised the resource errored in the way it did on a fresh install

The service account permission error could be because of multiple instances of the webhook. Just enabling the add-on with --enable-workload-identity shouldn't contain any errors.

karataliu commented 1 year ago

To clarify, there are two ways you can enable workload identity on AKS:

  1. Using AKS integration: https://aka.ms/aks/wi, this will install the webhook deployment in kube-system namespace
  2. Using open source: see https://github.com/Azure/azure-workload-identity/tree/main/charts/workload-identity-webhook, this will by default install the webhook deployment in azure-workload-identity-system namespace.

Using both together will result in a conflict.

The cause for issue here is there is a non-namespace resource clusterrolebinding After you install AKS version, it points to serviceaccount in kube-system namespace. When you then install opensource version it temply changed it to azure-workload-identity-system namespace. But AKS integration will keep refreshing it back to the kube-system namespace. Thus the pods in azure-workload-identity-system namespace will report errors.

The suggestion here is to choose only one of the solutions (AKS integration or open source).