Azure / azure-service-operator

Azure Service Operator allows you to create Azure resources using kubectl
https://azure.github.io/azure-service-operator/
MIT License
732 stars 193 forks source link

Bug: RoleAssignment UUID clashes #3637

Closed Roman-Galeev closed 4 weeks ago

Roman-Galeev commented 9 months ago

Version of Azure Service Operator v2.4.0

Describe the bug We have development, staging and production subscriptions for deployments, and each of them has a managed kubernetes service (AKS), so the algorithm used to calculate "stable" AzureName UUID produces clashes. This is the function used currently:

        assignment.Spec.AzureName = randextensions.MakeUUIDName(
            ownerGK,
            assignment.Spec.Owner.Name,
            gk,
            assignment.Namespace,
            assignment.Name)

I believe that "uniqueness" of the UUID could be easily improved by including into the seed string subscription ID (which is known to ASO anyways). However, this will not cover the case when there are more than one cluster in the subcription, but handling that is a more complex story, e.g. ASO could generate an instance ID on install, and use it as a seed.

matthchr commented 9 months ago

but handling that is a more complex story, e.g. ASO could generate an instance ID on install, and use it as a seed.

We considered this but avoided it because one of the goals for the autogenerated UUID was that it be stable across instances of ASO to support migration stories where a resource was moved from one ASO cluster to another.

I assume you must be running multiple instances of ASO across multiple clusters (maybe 1 for prod, 1 for staging, 1 for dev?). The current workaround is documented here (in the code)

// In the rare case users have multiple ASO instances with resources in the same namespace in each cluster
// having the same name but not actually pointing to the same Azure resource (maybe in a different subscription?)
// they can avoid name conflicts by explicitly specifying AzureName for their RoleAssignment.

Maybe this case is not as rare as we anticipated.

I'm open to moving to include subscription in this too, although we'd need to think about how to do so in a way which is backwards compatible as we won't want to orphan existing RoleAssignments.

matthchr commented 9 months ago

I'm open to moving to include subscription in this too, although we'd need to think about how to do so in a way which is backwards compatible as we won't want to orphan existing RoleAssignments.

Actually we wouldn't break existing RoleAssignments because defaulting only applies when they create a new one, it would break the case where they move RoleAssignments between two instances of ASO w/ the same name. Possibly we could document our way around that.

A (likely bigger) problem is that, at the point the webhook runs we don't actually know the subscription and it's actually possible that the subscription configuration doesn't exist yet, so it might be difficult for the webhook to take it into account.

Have you considered just using something like kustomize to make your dev/test/prod env roleassignments have slightly different names?

-dev, -test and -prod suffix on the RoleAssignment would solve the problem without you needing to fully manage the UUID yourselves.

Roman-Galeev commented 9 months ago

I assume you must be running multiple instances of ASO across multiple clusters (maybe 1 for prod, 1 for staging, 1 for dev?).

Yeah, each cluster has its own instance of ASO. Currently we generate UUIDs, but this is not really convenient, and UUID clashes are totally not nice, because it works, say, in dev, and then all of a sudden breaks in the next cluster.

-dev, -test and -prod suffix on the RoleAssignment would solve the problem without you needing to fully manage the UUID yourselves.

The idea is that deployment of ASO resources should be the same across environments, so we can promote changes between them in gitops way by making merge requests, so adding environment-specific suffixes to names breaks the assumption. It's possible to override the UUID or resource name on deployment, but IMO this is what operator should do, and it does, kind of.

we don't actually know the subscription

I think we do, as it's a configuration parameter of ASO itself. Actually, azureClientID used by ASO is the unique instance ID, so if we could use it as UUID5 namespace, then all UUIDs would be unique up to the instance.

Roman-Galeev commented 9 months ago

I would rather prefer dealing with errors like "assignment exists already" in case of reinstalling ASO with the new azureClientID and redeploying everything, than with RoleAssignmentUpdateNotPermitted, as the latter breaks the deployment, and the former is kind of expected.

Roman-Galeev commented 9 months ago

...and another idea is including principalID into role UUID seed, instead of owner.

related: https://github.com/Azure/azure-service-operator/issues/3318

theunrepentantgeek commented 9 months ago

There's an interesting conflict here between the two scenarios. In one case, generating a stable UUID for the role identity is crucial to allowing resources to be handed off between ASO instances; in the other, having a different UUID for each installation is needed to keep dev/test/prod environments separate.

I suspect there's no one-size-fits-all solution here; maybe we need to add some configuration to RoleAssignment to control how the stable UUIDs are generated. I'll drop this issue into our next milestone and tag as high-priority so we don't lose sight of it.