Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 310 forks source link

[Feature] An ability to specify OIDC issuer url (guid or even user-provided-string) deterministically during cluster creation time #3982

Open ericsuhong opened 1 year ago

ericsuhong commented 1 year ago

Is your feature request related to a problem? Please describe. Right now, when an AKS cluster is created with OIDC issuer enabled, OIDC issuer url is generated randomly such as:

https://westus2.oic.prod-aks.azure.com/[tenantId]/[random-guid]/

This poses a maintenance problem when we need to delete and recreate a cluster, because we need to ask all deployed services to update their identities to update their federated credentials with new randomly generated OIDC urls.

Describe the solution you'd like An ability to specify an OIDC issuer guid deterministically during cluster creation time such as:

az aks update -g myResourceGroup -n myAKSCluster --enable-oidc-issuer --oidc-issuer-guid=[guid]

or even better,

az aks update -g myResourceGroup -n myAKSCluster --enable-oidc-issuer --oidc-issuer-guid=[user-provided-identifier]

This will allow us to keep the same OIDC issuer url even when clusters are destroyed and recreated, and allow such cluster recreation process transparent to deployed services without having to ask them to update federated credentials.

illrill commented 1 year ago

This is a must-have for multi-tenancy, where the cluster lifecycle is typically controlled by a platform team, but the federated credentials and managed identities are controlled by users/developer teams.

Without a static/predictable/recoverable OIDC issuer URL, if the platform team needs to recreate the cluster for any reason, the OIDC issuer URL would get rotated and cause a breaking change for users' workload identity federations.

CocoWang-wql commented 12 months ago

Thanks for letting us know your feedback and user scenario. There is security risk for BYO (bring your own) OIDC Issuer url. We are seeking potential workaround.

illrill commented 12 months ago

Thanks for attention @CocoWang-wql. I don't think we necessarily need the ability to BYO issuer URL at cluster creation, as long as all federated credentials "find their way back" after a cluster recreation. I suppose there could be a couple of angles to approach this from, here's a few:

  1. Introduce a new resource such as a Microsoft.Authorization/OIDCIssuer that we can attach to an AKS cluster (or even better, attach it simultaneously to multiple AKS clusters). Needless to say, its lifecycle would need to be decoupled from the AKS cluster's lifecycle.
  2. Make the federated credential (FC) on e.g. the user-assigned managed identity (UAMI) require only the AKS resource ID as input (not its OIDC issuer). When the AKS cluster is recreated, the FC would need to be automatically updated with the recreated cluster's OIDC issuer (Note: In our case, the AKS cluster and the UAMI/FC reside in different subscriptions, but within the same tenant).
CocoWang-wql commented 11 months ago

Thanks for the info. Would like to know more details: From your description, I understand the pain point is: you need to update OIDC urls on all services after cluster re-creation. The question here is: in pod yaml, the only introduced parameter is service account name. IMO, after OIDC url changes, you only need to establish federated identity credential and doesn't need to update each pod yaml file as the service account name doesn't change.

@illrill @ericsuhong

illrill commented 11 months ago

The problem is not with the Pod or Service Account. The problem is that the User-assigned Managed Identity to which the Service Account is federated via a Microsoft.ManagedIdentity/userAssignedIdentities/federatedIdentityCredentials resource, has a property issuer that requires the cluster's OIDC issuer URL (which is unpredictable).

{
  "audiences": [
    "api://AzureADTokenExchange"
  ],
  "id": "/subscriptions/<subscription>/resourcegroups/<resource-group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<managed-identity>/federatedIdentityCredentials/<federated-credential>",
  "issuer": "https://westeurope.oic.prod-aks.azure.com/<subscription>/<oidc-issuer-url>/",
  "name": "<federated-credential>",
  "resourceGroup": "<resource-group>",
  "subject": "system:serviceaccount:<namespace>:<service-account>",
  "systemData": null,
  "type": "Microsoft.ManagedIdentity/userAssignedIdentities/federatedIdentityCredentials"
}

Here's the scenario.

  1. The user/developer team has created a federatedIdentityCredentials resource on their userAssignedIdentities resource. They have specified the current cluster's OIDC issuer URL. This enables the Service Account to assume the identity of the User-assigned Managed Identity when interacting with Azure. Everything is fine.
  2. The platform team decides to recreate the cluster for some reason (delete old + create new with same name, and use e.g. Velero to backup & restore all of the cluster's K8s resources). The new cluster gets a new OIDC issuer URL.
  3. The federatedIdentityCredentials issuer property has now become outdated. The Service Account is no longer allowed to assume the identity of the Managed Identity, because there is no valid federatedIdentityCredentials anymore. In other words, the workload identity federation is broken.
  4. The user/developer team is forced to repair the situation by updating the federatedIdentityCredentials resource's issuer property with the new OIDC issuer URL.

The practical result of this is that an AKS cluster must be treated like a "pet" that can never be recreated. Because if we do, we cause a breaking change for all users/developers in the sense that all workload identity federations stop working and we need to call every developer/user and ask them to update their federatedIdentityCredentials issuer property.

ericsuhong commented 11 months ago

The pain point is having to re-establish federated identity credential with updated OIDC url for all services. Imagine running 100+ services (with distinct MSI per each) in a cluster and having to update OIDC url for each.

ceilingfish commented 10 months ago

Also feeling the pain of this issue. We have to coordinate the recreation of all managed identity federations.

If there was a way to programmatically find all federated credentials for a cluster (by tag name or something), then we could automate this, but currently we'd have to search through all managed identities to find matches.

I guess we could experiment with storing this URL behind some reverse proxy, but that seems like a lot of experimentation for something that might not work.

duncan485 commented 8 months ago

+1 on this issue, this makes it a huge pain for platform teams that need to replace clusters. There should be a way to have a 'static' endpoint so we do not need to update the federations on all off the identities

pranaypathik commented 7 months ago

+1 on the issue. we don't have control on downstream configuration which add complexity and dependency.

ceilingfish commented 7 months ago

I have now paused migration to workload identity as this would make DR so much harder. Will stick with aad pod identity until resolved.

qhris commented 6 months ago

The inability to share OIDC issuer URIs for clusters is a pain point for workload identity adoption. We do blue/green kubernetes cluster deployments to avoid potential issues during infrastructure updates. E.g:

aks-cluster-dev-blue   <- active ingress
aks-cluster-dev-green

The problem is that every update cycle; the new cluster would get a new issuer URI and we'd have to keep track of and re-create every federation again (actually keep two instances because both clusters are online at the same time). This is something we have solved for our self hosted clusters where we can bring our own static issuer.

Having an issuer as a separate object in azure would be great, and the ability to optionally specify one when creating/recreating a cluster. In our case the two clusters would simply point at the same issuer. In this scenario it doesn't matter what the URI is as long as it's static. The complexity of creating and rotating keys could also be abstracted away from the user.

EDIT: Reading this again I realized what I suggested above is exactly what @illrill suggested, I missed that somehow :)

jordan-owen commented 6 months ago

+1 for this

A possible workaround would be to use Terraform to destroy/create an AKS cluster, and then a Terraform apply to update the identities based on the new cluster OIDC issuer URL. It would be great not to have to do this.

mfacenet commented 4 months ago

+1 for this

A possible workaround would be to use Terraform to destroy/create an AKS cluster, and then a Terraform apply to update the identities based on the new cluster OIDC issuer URL. It would be great not to have to do this.

that's a potential workaround but a bad one, as called out above, my team manages the "platform" and we have 100's of services deployed on the cluster, we don't manage their identities and in many cases can't even see them.

We've engaged with Microsoft professional services (how I was linked to this thread) as we have a similar issue as to what's mentioned, our DR strategy is to replace the cluster if something goes catastrophically wrong, we also utilize our own on-prem clusters which allow us to manage our own JWKS/oidc endpoints, which is not plausible with Azure since we have no read/write access to the service signing key nor cluster configuration at that level. We recently went through trials with GKE and with their fleet (what used to be Anthos) there's a single endpoint that multiple clusters take on, that was part of the question request as well as we run many clusters per environment (effectively one federated credential for "prod" instead of one per cluster).

illrill commented 1 month ago

Will this be tracked/resolved by #2861?

wolszakp commented 1 week ago

@CocoWang-wql any updates in this topic?