Azure / AKS

Azure Kubernetes Service
1.92k stars 284 forks source link

[BUG] PIM Activation Results in Unreliable Access Rights Update for Azure RBAC Roles #4232

Closed RobClaySW closed 3 weeks ago

RobClaySW commented 3 weeks ago

Describe the bug We are experiencing the exact same behaviour as described in issue https://github.com/Azure/AKS/issues/3211 and have found it extremely detrimental for our attempts to limit production access whilst still being able to respond to issues in a timely manner.

We have four AKS instances set up using Azure AD authentication with Azure RBAC and on our production instance make use of Microsoft Entra ID PIM Groups for just-in-time access to higher privilege operations via the Azure Kubernetes Service RBAC Cluster Admin role, but have constant read access via Azure Kubernetes Service RBAC Reader. On non-production environments we have permanent access via the aforementioned admin role and everything works fine.

When we attempt to activate the PIM assignment that will grant us JiT assignment of the Azure Kubernetes Service RBAC Cluster Admin role on production we've found that more often than not there is an extended delay before being able to utilise the rights granted by the role and that this can range from being almost instant (as required) to being anything up to an hour.

We think that this delay behaviour may be caused by accessing any AKS instance in your Tenant and then activating the PIM assignment for the production instance, possibly resulting in some sort of caching at the Azure side of things? We tried the follow over a few days:

Whilst the above checks we carried out aren't cut and dry they might hopefully give some pointers. We've also found similar to the ticket referenced at the start in that the Azure Portal seems to often update faster, though this isn't always the case and seems to kick in separately to using CLI/kubectl.

To Reproduce Steps to reproduce the behavior:

  1. Have at least two AKS instances set up using Azure AD authentication with Azure RBAC
  2. Grant yourself permanent Azure Kubernetes Service RBAC Cluster Admin on the "non-production" instance; you should be able to perform actions like kubectl get secrets --cluster NONPRODUCTION_CLUSTER or kubectl get nodes --cluster NONPRODUCTION_CLUSTER successfully on this Cluster.
  3. Grant yourself permanent Azure Kubernetes Service RBAC Reader on the "production" instance; you should be able to perform actions like kubectl get pods -n PRODUCTION_NAMESPACE --cluster PRODUCTION_CLUSTER but shouldn't be able to perform actions such as kubectl get secrets --cluster PRODUCTION_CLUSTER or kubectl get nodes --cluster PRODUCTION_CLUSTER, which instead result in a forbidden error.
  4. Set up an AAD/Entra ID Privileged Identity Management Group and assign it the Azure Kubernetes Service RBAC Cluster Admin role on the "production" AKS instance.
  5. Make your account eligible to activate the PIM Group then run something like kubectl get secrets --cluster PRODUCTION_CLUSTER to have a recent access attempt (you should still get the forbidden error).
  6. Action the activation (you can use the default 8 hours, it doesn't really matter).
  7. Once you've got confirmation of the activation, attempt to access the more restricted "production" instance resources again such as kubectl get secrets --cluster PRODUCTION_CLUSTER and you should still get the forbidden error. Keep trying this command and it should keep failing for some undetermined amount of time but may eventually allow you access.

Expected behavior Once a PIM Group assignment has been activated that itself is assigned the Azure Kubernetes Service RBAC Cluster Admin role on an AKS instance, the change in access rights should be reflected in a timely manner for the user (ideally within a couple of minutes).

Screenshots N/A though screenshots can be generated if needed to aid troubleshooting.

Environment (please complete the following information):

Additional context N/A but many thanks for taking the time to read this mini-essay!

JoeyC-Dev commented 3 weeks ago

I actually wrote about this behaivour on my blog. I will quote this from my own article:

Due to the design of token lifetime, if you are granting roles to users who use CLI tools, like kubectl/kubelogin, the duration of activating (granting) roles during approval process technically can not be lower than 60 minutes. Even the duration is being set as 0.5 hours, the actual effective time is still between 60-75 minutes. This is because when kubelogin is trying to get tokens from Microsoft identity platform, access_token and refresh_token will be returned for further use. access_token is used to make request to API, and refresh_token is used to get new access_token if the original one is invalid. The access_token cannot be revoked once being generated. Only the refresh_token can be revoked.

Link: https://blog.joeyc.dev/posts/aks-access-control-pim/

In simple: This is bug as intended. As long as OAuth based flow is being used, it will act like this. I believe you need to submit a feature request here: https://github.com/Azure/kubelogin, for example, to make it automatically refresh the token.

For workaround, try kubelogin remove-tokens and login again.

RobClaySW commented 3 weeks ago

@JoeyC-Dev, thanks for the brilliant response and your blog link, incredibly helpful in explaining this interaction and the workaround is spot on.

I do think it would be very beneficial if Microsoft could add some reference to this on one of the documentation pages such as https://learn.microsoft.com/en-us/azure/aks/access-control-managed-azure-ad under the troubleshooting section, as we clearly didn't have a detailed enough understanding of the auth flow interaction (which is our own downfall) and weren't able to come across an explanation like the above in our research (again, perhaps our own shortcoming).

Either way, I'll close this off as the behaviour is as intended as stated in the comment above.

JoeyC-Dev commented 3 weeks ago

@RobClaySW I tried (https://github.com/MicrosoftDocs/azure-docs/issues/118377), but they keep holding my tutorial from publishing for some reason. So I publish it on my own blog as I cannot wait anymore.