akuity / kargo

Application lifecycle orchestration

https://kargo.akuity.io/

Apache License 2.0

1.79k stars 146 forks source link

ACR short-lived token support #2372

Open blakepettersson opened 3 months ago

blakepettersson commented 3 months ago

Checklist

[x] I've searched the issue queue to verify this is not a duplicate feature request.

Proposed Feature

Adding the ability for "password-less", short-lived token support for freight (OCI Helm charts + Docker images) in ACR.

Motivation

This already exists for ECR and GCR repositories, so IMO it would make sense to also have the same for ACR.

Suggested Implementation

There are some differences between how Azure does IAM and how other clouds handle IAM. Whereas with AWS IAM we derive the IAM role name from the Kargo project name, and with GCP the service account name is also derived from the Kargo project name, doing this in Azure becomes a whole lot trickier.

Everything in Azure is done on the basis of a ClientID, so in order to do what we really want to do (= get a short-lived token for ACR) we would first need to lookup the correct ClientID by first finding a Managed Identity matching the correct name. Before we get to that point though, we would need to first retrieve a managed identity client that can retrieve said managed identity, and for that we would to get the client by looking it up with a subscription id and resource group name. While all of this could be done, I'd like to propose an alternative.

Instead of doing the steps above, I'd like to instead propose the federation of Kubernetes service account tokens to Azure Managed Identities. It is already the case that every Kargo project corresponds to a Kubernetes namespace, and each of those namespaces currently has two service accounts - kargo-viewer and kargo-admin. For this example I'm using kargo-viewer as the example service account to federate with, but this could potentially be kargo-admin or another serviceaccount altogether custom made for token creation.

For a user to take advantage of this, the steps would be to

Create a user-managed-identity
Federate the managed identity to the relevant Kargo namespace and Kargo service account.
Give the relevant permissions (AcrPull) to the managed identity in the relevant Azure repositories

We still have the issue of assigning the ClientID (as well as the TenantID) to the correct Kargo project. There are a few ways in which we could do this, but I think the one which would require least work is to retrieve those values from the Kargo project directly. For that I'd like to propose the addition of two annotations to be used for this purpose:

azure.workload.identity/client-id (mandatory for this to work)
azure.workload.identity/tenant-id (optional, fallback would be whatever is set on the Kargo controller pod)

The Kargo controller gets the ClientID and TenantID from the Kargo project CR and if everything is good, starts the OIDC dance:

It looks up the relevant Kargo serviceaccount (kargo-viewer for this example)
Uses the TokenRequest API to create a temporary token for said serviceaccount
Uses that token to exchange it for an Azure AD Token
Exchanges the Azure AD token to retrieve an ACR token scoped for the relevant repo, which is valid for 3h.

There are a few implications in regards to K8s RBAC for the kargo-controller:

We'd need to grant read access for serviceaccounts (as well as list and watch due to the serviceaccount informers requiring that access)
We'd also need to grant access to the TokenRequest API, with create privileges for serviceaccounts/token.

We'd need to ensure that this is something that's only allowed within kargo-project namespaces.

krancour commented 3 months ago

I've spent hours and hours pouring over Azure and Entra ID docs and have consulted a fair amount with Copilot as a sanity check in case I am missing something.

I believe there is a much more fundamental issue at play here than anything mentioned above. The challenges of working with numeric client IDs instead of predictable identity or role names set aside for a moment, Azure simply does not have any service that is equivalent to AWS STS -- that is to say there is no possible way for the controller's managed identity to assume a project-specific role to obtain a token narrowed to project-specific permissions.

The challenge becomes one of somehow emulating what STS does, but within the considerable constraints of Azure.

Speaking directly to your proposed method of doing this, a managed identity for every project that is federated to a project-specific service account is an interesting idea, however, I see those pesky "???" in your step 4. It's a dead end, because there is no way to make the Kargo controller dynamically use that service account, and even if there were, it still wouldn't work, because the way managed identity works for a pod in the first place is that it is decorated at creation time with credentials for the managed identity that is federated to the single service account referenced by the pod spec.

The bottom line is that what we've done for ECR and GAR is unachievable for ACR.

With ECR, we did implement something for the cases where a project-specific token could not be obtained or had insufficient permissions -- we fall back on the controller's own permissions. We haven't done the same for GAR yet, but should. It is quite possible that the very best we can do for ACR is to set aside the notion of using project-specific permissions and jump immediately to what we do as a fallback with ECR -- directly using the controller's own permissions.

This is weakens Kargo's tenancy model slightly when running on Azure and opting to use managed identity, but I am pretty convinced that this is about the best we can reasonably do.

blakepettersson commented 3 months ago

Thanks for taking the time looking into this! I do have to say that the Azure documentation, to put it mildly, is not great. I took my inspiration from external-secrets, which pretty much does what I'm proposing.

, however, I see those pesky "???" in your step 4.

That was referring to the fact that it's not clear how this would work with the client id and tenant id - I'll remove those question marks since it just adds confusion.

it still wouldn't work, because the way managed identity works for a pod in the first place is that it is decorated at creation time with credentials for the managed identity that is federated to the single service account referenced by the pod spec.

I'm not sure I'm following here, can you clarify?

I have a very rough PoC branch, and with this I managed to get freight from an ACR repository. I might be missing something fundamental here which prevents this from being a thing though, please let me know 😄

krancour commented 3 months ago

That was referring to the fact that it's not clear how this would work with the client id and tenant id - I'll remove those question marks since it just adds confusion.

Ok. No worries. As I said, my worry is that Azure lacks a fundamental capability that we would need and I mis-interpreted the ??? as, "idk how to get from here to here," which may have been true, but was perhaps not in reference to what I was concerned with.

it still wouldn't work, because the way managed identity works for a pod in the first place is that it is decorated at creation time with credentials for the managed identity that is federated to the single service account referenced by the pod spec.

I'm not sure I'm following here, can you clarify?

This is admittedly an area where I did less reading, but I assume Azure's workload identity solution works somewhat similarly to AWS's and GCP's here. The pod is "decorated" at creation time, probably via a mutating/defaulting webhook, and its responsibility is to look at what SA the pod uses, see what managed identity that SA is federated to, obtain credentials for that managed identity, and inject them into the pod. This leaves the workload running inside the pod in a state where is can effectively snatch managed identity credentials out of thin air using some Azure SDK.

What I had been trying to say was that although I found the idea of a managed identity per Kargo project to be an interesting one, what I explained above seems as if it would get in the way. Kargo can look up a SA. I see your PoC does that. But finding that SA and even using the token request API (which your PoC also does) leaves you with a token suitable for building a k8s client and accessing Kubernetes resources. All the "magic" described above of obtaining Azure credentials belonging to the managed identity federated to that SA remains undone...

As far as I can tell, it is in the exchangeForEntraIDToken() func that your PoC is trying to emulate that little bit of magic, but unless I am reading the code incorrectly (possible) or unless I missed some important bits of Azure docs (very possible, given how the information is so spread out), it looks like you are building an EntraID client using a Kubernetes token. Is that a mistaken, or does it somehow actually work?

Correction:

As far as I can tell, it is in the exchangeForEntraIDToken() func that your PoC is trying to emulate that little bit of magic

It's not the whole thing, but the first step of it, rather -- Getting an EntraID client so you can then look up the appropriate managed identity. My question of how an EntraID client is being constructed using a k8s token as a credential remains.

krancour commented 3 months ago

This is admittedly an area where I did less reading, but I assume Azure's workload identity solution works somewhat similarly to AWS's and GCP's here...

Ok... there are some implementation differences, but this was essentially correct, that workload identity in both EKS and AKS involve a webhook that injects information into pods at creation time.

So a call to config.LoadDefaultConfig(ctx) in AWS and a call to azidentity.NewDefaultAzureCredential(nil) in Azure more or less do the same thing -- the "plucking creds from thin air" that I referenced earlier. Both of these use information the pod was injected with at creation time to get credentials for the SA that the workload is running as.

For AWS, that is fine, because we can start from the controller's own permissions, use them to assume a project-specific role and we're in business.

In Azure, I saw this as a limitation, because we basically have no real interested in Azure creds for the SA that the controller is running as. We want to get Azure creds for the managed identity that some project-specific SA is federated to. i.e. We don't want to call azidentity.NewDefaultAzureCredential(nil). What we do want to do, which you have mostly worked out already @blakepettersson, is use lower-level APIs to do what azidentity.NewDefaultAzureCredential(nil) does, but for a project-specific SA instead of the SA that the controller is running as.

I mistakenly believed this was, by design, not possible, but I was wrong. So, ugly as I think this process looks in Azure, I've developed more confidence that the Azure parts of your solution are on the right track. My apologies that it took me so long to gain clarity on this.

The things that I still want to resolve here:

This strategy requires the controller to have very broad permissions to obtain tokens for any SA in the cluster. I'm not totally crazy about that for obvious reasons, but it wouldn't be the first time we gave the controller some pretty broad permissions out of sheer necessity and took some solace in controllers being inherently less vulnerable to attack since they are not user-facing. I could live with this if we had to settle for it, but if we can constrain this somehow, it would be preferable.
The bit about requiring Projects to be annotated with clientIDs feels, at best, inconvenient, and at worst, like a vulnerability.
1. Do we need it at all? If the SA is federated to a managed identity and EntraID can validate a SA token, then EntraID should know what the clientID is. We shouldn't have to tell it. When I inquired with Copilot about this, it claimed that, for this scenario, we could pass an empty string as the clientID. I'm dubious of this claim, but it would be wonderful if it were true. We need to put it to the test. This is the ideal scenario.
2. If we do need it, does it create a vulnerability? What happens if Project A is annotated with Project B's clientID? Does Project A gain access to everything Project B has access to? OR... referring back to the question above re: whether we need clientID at all: If AAD knows the clientID of the managed ID that is federated to the SA, will it notice the mismatch and prevent the Project A SA from being issued a token for accessing Project B resources. I would assume so, but also think this calls for more research and experimentation.
I don't want to federate Project-specific managed identities to any of the existing SAs that are created in each Project namespace. Those are for "use" by users. I would prefer that we have a new SA in every project namespace that is dedicated for this purpose.

All-in-all, I do think you're on the right track and that these issues seem as if they should be resolvable.

blakepettersson commented 3 months ago

Ok... there are some implementation differences, but this was essentially correct, that workload identity in both EKS and AKS involve a webhook that injects information into pods at creation time.

So a call to config.LoadDefaultConfig(ctx) in AWS and a call to azidentity.NewDefaultAzureCredential(nil) in Azure more or less do the same thing -- the "plucking creds from thin air" that I referenced earlier. Both of these use information the pod was injected with at creation time to get credentials for the SA that the workload is running as.

👍

I mistakenly believed this was, by design, not possible, but I was wrong. So, ugly as I think this process looks in Azure, I've developed more confidence that the Azure parts of your solution are on the right track. My apologies that it took me so long to gain clarity on this.

No worries, it took me a few days to figure out how this all fits together - if it wasn't for the existing external-secrets implementation I don't think I'd have pieced this together.

This strategy requires the controller to have very broad permissions to obtain tokens for any SA in the cluster. I'm not totally crazy about that for obvious reasons, but it wouldn't be the first time we gave the controller some pretty broad permissions out of sheer necessity and took some solace in controllers being inherently less vulnerable to attack since they are not user-facing. I could live with this if we had to settle for it, but if we can constrain this somehow, it would be preferable.

Agreed, this is far from ideal - I'll look into how this can be constrained.

The bit about requiring Projects to be annotated with clientIDs feels, at best, inconvenient, and at worst, like a vulnerability.

Do we need it at all? If the SA is federated to a managed identity and EntraID can validate a SA token, then EntraID should know what the clientID is. We shouldn't have to tell it. When I inquired with Copilot about this, it claimed that, for this scenario, we could pass an empty string as the clientID. I'm dubious of this claim, but it would be wonderful if it were true. We need to put it to the test. This is the ideal scenario.

I'll check it, but I also have doubts that this'll work without explicitly setting a ClientID.

If we do need it, does it create a vulnerability? What happens if Project A is annotated with Project B's clientID? Does Project A gain access to everything Project B has access to? OR... referring back to the question above re: whether we need clientID at all: If AAD knows the clientID of the managed ID that is federated to the SA, will it notice the mismatch and prevent the Project A SA from being issued a token for accessing Project B resources. I would assume so, but also think this calls for more research and experimentation.

I don't think it does, since for any of this to work in the first place we need to federate a service account to a managed identity. If there's a mismatch between the service account and the managed identity (whether we need to explicitly specify a ClientID or not) this should fail. What can happen is that multiple service accounts can be federated to the same managed identity (from what I've understood a single managed identity can federate up to 20 service accounts), which can lead to the scenario which you describe.

I think it's a matter of telling users that they would be "holding it wrong" in that scenario and being explicit that the recommendation is to use a separate managed identity per Kargo project. This isn't too dissimilar to the scenario with stacking excessive permissions on the Kargo controller role with AWS IAM, with the same caveats.

I don't want to federate Project-specific managed identities to any of the existing SAs that are created in each Project namespace. Those are for "use" by users. I would prefer that we have a new SA in every project namespace that is dedicated for this purpose.

Yup, makes total sense to me!

All-in-all, I do think you're on the right track and that these issues seem as if they should be resolvable.

🫶

blakepettersson commented 3 months ago

I'll check it, but I also have doubts that this'll work without explicitly setting a ClientID.

ClientID is mandatory - I get this if not specified:

error discovering artifacts: error discovering charts: error obtaining credentials for chart repository "oci://blake.azurecr.io/helm-guestbook": FromAssertion(): http call(https://login.microsoftonline.com/235c0747-63b4-4d89-98ee-382854c65b67/oauth2/v2.0/token)(POST) error: reply status code was 400: {"error":"unauthorized_client","error_description":"AADSTS700016: Application with identifier 'https://eastus.oic.prod-aks.azure.com/235c0747-63b4-4d89-98ee-382854c65b67/b5dba200-1680-4a23-ac63-a93ea34d2989/' was not found in the directory 'Default Directory'. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You may have sent your authentication request to the wrong tenant.

And if I do not specify TenantID I get this (although we could in theory fallback to using the tenant id of whatever has been applied to the kargo-controller pod, assuming the auth webhook is used):

Tenant 'v2.0' not found. Check to make sure you have the correct tenant ID and are signing into the correct cloud. Check with your subscription administrator, this may happen if there are no active subscriptions for the tenant.

krancour commented 3 months ago

ClientID is mandatory

That is a bummer. But since you've reminded me that a single managed identity can be federated to up to 20 service accounts... does it work in the other direction as well? A single service account federated to multiple managed identities? I'm pretty sure it's possible -- and it would explain why clientID is mandatory. It disambiguates.

And if I do not specify TenantID...

TenantID is less worrisome. We can set that globally in the chart and not require it to be set on a per-project basis.

juliusl commented 3 months ago

Hello, I'm from ACR and I work on container runtime stuff. I'm not entirely familiar with Kargo but is it a component running on the node itself?

And to paraphrase the issue, currently you need to get a client-id to start the authn flow with acr to grab an access token, but you are having issues resolving that client-id with the managed identity?

Edit: Oh and also is this scenario specifically targeting AKS? Or is the scope larger than that?

dtzar commented 3 months ago

Kargo is the OSS project which technically could get installed on any K8s cloud provider (e.g. GKE, EKS) that would need access to ACR, although I think the primary use case would be installed on an AKS cluster. e.g. A person using Kargo would want to roll out artifacts stored in ACR to just AKS clusters from an AKS cluster where Kargo is installed.

Kargo is sitting there monitoring ACR for changes to specified artifact(s).

I have no idea how to answer the auth parts, but hopefully that answers the general "what is Kargo trying to do / context for it being used".

juliusl commented 3 months ago

Thanks @dtzar for the clarification.

@blakepettersson @krancour for the use case, is using the kubelet identity (MSI) that is created with the node pool sufficient to satisfy this scenario?

krancour commented 3 months ago

Hi @dtzar and @juliusl!

I don't think we're actually experiencing any real issues at the moment, per se. We're just doing the leg work right now to get this working as similarly as possible to the way it works for EKS and GKE.

Some background on what we're trying to do, regardless of which public cloud Kargo is running in: Kargo itself is multi-tenant, with every tenant having its own namespace and some default ServiceAccounts, permissions, etc. that are all created automatically when a Project CRD is reconciled.

Naturally, each Kargo Project may have access to different ACR registries and repositories therein.

On both EKS and GKE, we set up workload identity for the Kargo controller(s) itself. We recommend those identities have no permissions except to assume other, Project-specific roles. This constrains the controller, when acting on behalf of a given Project, to Project-specific permissions.

In AKS, this has been more challenging, to say the least, because Azure lacks anything equivalent to AWS STS. i.e. The ability to assume a different role with specific permissions does not exist.

What we have settled on (and please do feel free to correct us if you think this is not the best approach) is, since every Project already has certain ServiceAccounts that we know will always exist, every Project will have one of its ServiceAccounts (a new one, dedicated for this purpose, actually) federated to its own managed identity. Kargo controllers use the Kubernetes token request API to acquire a token that can be exchanged for an Entra ID token belonging to the corresponding managed identity. We use that to obtain the temporary token for ACR access.

Most of the work that remains here is a matter of how best to streamline the above process. "Best" here meaning:

Best UX for the admin who installs Kargo (it's a Helm chart). Ideally, they wouldn't have to specify any Azure specific details. Things would just magically work.
Best UX for the Project owners/admins. Ideally, when setting up a Project, they would not have to specify any Azure specific details. Again -- things would just magically work.

This concern is the one we have had the hardest time addressing. Two pieces of information are required -- EntraID tenant ID and the Project-specific managed identity's client ID.

We expect all Kargo Projects' managed identities to reside in the same Entra ID tenant, so we can make the task of the Project owner/admin slightly less onerous by pushing it back on the admin who installs Kargo to configure tenant ID globally. This goes against what I said previously in no.1, but so be it if that's what we must do.

As far as client ID is concerned, since the Kubernetes service account in question should already be federated to a managed identity, it's somewhat strange that we need to provide the client ID. Entra ID already knows the client ID of the managed identity. But I suppose the requirement that we provide client ID ourselves is probably a matter of disambiguation since it is possible for a single Kubernetes ServiceAccount to be federated with multiple managed identities? If that's the case, we get it, but it's still inconvenient.
Cleanest code possible. Ideally, we would not need to resort to directly building and executing HTTP requests to access any Azure APIs.

At the moment, I believe IMDS is the main thing where we are forced to do this because, as far as we've been able to determine, there is no Go SDK for IMDS.

So... as I said, we would welcome any guidance you have on approaching this better/differently than we currently are, but also, the following improvements to Azure would help not just ourselves, but many others including External Secrets.

(Having formerly worked in the Azure org, myself, I know these are huge asks and don't realistically expect any movement on them.)

An equivalent to AWS STS so that one identity can (with appropriate permissions) assume a specific role or impersonate another identity. (Huge, I know.)
A way to exchange a k8s ServiceAccount token for an Entra ID token without needing to explicitly specify client ID. (I actually doubt this is possible, but one can dream.)
An SDK for IMDS. Just a nice to have. (fwiw, I'm pretty sure this doesn't exist because all the language bindings for Azure APIs are generated from Open API specs and IMDS is very different from other services because it doesn't involve requests out to a remote service endpoint; rather the requests are resolved more locally, without ever leaving the host.)

juliusl commented 3 months ago

Thanks for the thorough summary of what you're trying to accomplish. A couple of thoughts or things to consider in case you might not have been aware of it,

1) If the security boundary was the node-pool, i.e. Kargo Project is 1:1 with a NodePool - than in each node belonging to the node pool you would have access to clientid/tenantid config in /etc/kubernetes/azure.json (schema here) or in the resource tags on the VM "aks-managed-kubeletIdentityClientId` available via IMDS, ex:

"tagsList": [
      {
        "name": "aks-managed-kubeletIdentityClientID",
        "value": "<client-id-guid>"
      },
      {
        "name": "aks-managed-enable-imds-restriction",
        "value": "false"
      },
...

Or, from the labels in the K8's Node, kubernetes.azure.com/kubelet-identity-client-id:<client-id>.

This is how service components today are able to access the MSI for authenticating with the registry when a registry has been attached to a specific node-pool kubelet identity.

An equivalent to AWS STS so that one identity can (with appropriate permissions) assume a specific role or impersonate another identity. (Huge, I know.)

Actually, I think this should be possible with the OAuth OBO flow however in this case is Kargo focused on multi-tenant scenarios wrt to the cluster admin? i.e. in this supported scenario can the ACR registries belong to different Entra tenants?

Either way it should work, but in the multi-tenant scenario, Tenant A would need a service principal in Tenant B.
If all registries belong to the same Entra Tenant but different subscriptions, then the MSI of the kublet can be granted a role to that regsitry and should be valid across all subscriptions within the Tenant. https://learn.microsoft.com/en-us/entra/identity-platform/v2-oauth2-on-behalf-of-flow

A way to exchange a k8s ServiceAccount token for an Entra ID token without needing to explicitly specify client ID. (I actually doubt this is possible, but one can dream.)

Yeah, I agree, that's probably not going to happen, at least not in any type of opaque way.

An SDK for IMDS. Just a nice to have. (fwiw, I'm pretty sure this doesn't exist because all the language bindings for Azure APIs are generated from Open API specs and IMDS is very different from other services because it doesn't involve requests out to a remote service endpoint; rather the requests are resolved more locally, without ever leaving the host.)

So, if it's just the IMDS token endpoint, then the Go SDK does have bindings via the managed_identity_client.go, but if you mean an IMDS client for the IMDS endpoint for the actual instance metadata, than yeah I'm not aware of one existing.

That being said, that IMDS endpoint is pretty much the same as the Cloud-Init IMDS endpoint which means it's a fairly safe API to write against.

I don't think we're actually experiencing any real issues at the moment, per se. We're just doing the leg work right now to get this working as similarly as possible to the way it works for EKS and GKE.

Sure, no worries let me know if you need any assistance and I'll try to help as best as I can. Thanks for doing this work to support ACR it is greatly appreciated.

krancour commented 3 months ago

Thanks @juliusl!

You've given us some things to think about here. We won't be shy about getting in touch if we have follow-up questions.

krancour commented 3 months ago

See discussion in #2399. The solution proposed here doesn't hold up after all. The ask is, of course, a valid one, so I will leave the issue open.

krancour commented 3 months ago

I've realized the originally proposed strategy could work if we somehow establish the project-specific SAs in question in the cluster(s) in which the controllers run, rather than in the control plane.

That's not too hard, as it turns out.

Today, Projects are reconciled only by the management controller, which runs in the control plane, but we can add a Project reconciler to the main (sharded) controller as well. It would ensure the existence dedicated SA for each Project in Kargo's own namespace on every shard. These SAs would be the ones to federated with managed identities.

With that particular issue resolvable, here's another:

We really cannot make assumptions about all sharded controllers running in the same Azure account / using the same EntraID tenant. This means the correct tenantID + clientID for a given Project may vary by shard. Even assuming tenantID is constant per-shard, and can therefore be configured at install-time, we will still require a more complex Project --> clientID mapping scheme than was originally proposed...

Instead of a single annotation on a Project specifying a single clientID, I propose using annotations of the following form: azure.kargo.akuity.io/<shard-name>/client-id: <client ID>.

This isn't an enormous amount of work, nor is it even remotely trivial. I do foresee certain complexities in the new reconciler that aren't worth diving into at the moment. One thing I question is whether the UX around maintaining Project + shard --> client ID mappings isn't so onerous as to discourage anyone from using this feature if it were built... i.e. I'm grappling with whether the effort is justifiable.

So, what I propose:

For now, let's abandon the notion that we need to achieve perfect parity with what we can do on AWS or GCP. This means, at least temporarily, abandoning the notion of a managed identity per Project. In the interim, controllers running in Azure could attempt to access ACR repos using the permissions of a managed identity that is federated to the controller's own SA.

fwiw, in AWS, if the controller is unable to assume a Project-specific role or the Project-specific role lacks necessary permissions to access the ECR repository in question, then the controller falls back on using its own permissions directly. So we'd at least be making the behavior on Azure match the behavior that we fall back on in AWS.

With at least that much in place, I believe we could wait and see what appetite users have for managed identity per Project + shard and whether they want it badly enough to look past the mappings they'd need to maintain. If it ends up being something people want, the above outlines how to do it, and we could re-open #2399 since that PR's got "good bones."

juliusl commented 3 months ago

We really cannot make assumptions about all sharded controllers running in the same Azure account / using the same EntraID tenant. This means the correct tenantID + clientID for a given Project may vary by shard. Even assuming tenantID is constant per-shard, and can therefore be configured at install-time, we will still require a more complex Project --> clientID mapping scheme than was originally proposed...

Apologies if I missed something, but when you're using the http://169.254.169.254/metadata/identity/oauth2/token endpoint to authenticate, today it only requires clientId=<>, when it's a user msi. The token that comes back should have a tid= set in the JWT payload which is the tenant id that the client id belongs to. As long as the VMSS has the managed identity assigned to it, you should only need clientId to resolve the tenantId.

So for example, on a VM to get a token for exchange with ACR you should only need to do (in the system-assigned MSI case for example),

curl -H 'Metadata: true' 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=&resource=https://management.azure.com/` # Trailing-slash intentional

You would get back a JWT access token in the response which you can decode to get then get the tid (although in this case this is actually good enough to exchange with ACR for an access_token grant, but I'm assuming you need the tid for the workload identity?)

juliusl commented 3 months ago

Also, out of curiosity what happens when you use the /common tenant id? https://learn.microsoft.com/en-us/entra/identity-platform/howto-convert-app-to-be-multi-tenant#update-your-code-to-send-requests-to-common

krancour commented 3 months ago

As long as the VMSS has the managed identity assigned to it, you should only need clientId to resolve the tenantId.

@juliusl I don't think resolving the tenantID is our main challenge.

Since Kargo's purpose is to deliver artifacts from environment to environment, often interacting with multiple Argo CD control planes, which often live in different clusters, or even different subscriptions, its topology allows for its controllers to be distributed 1:1 with Argo CD control planes, typically deployed right alongside them. They all "phone home" to the Kargo control plane.

It is at the controllers that ACR credentials may be needed, and as I mentioned, these may be situated in different clusters, different subscriptions, or use different EntraID tenants.

Resolving the tenantID isn't especially problematic. In the best case scenario where we can pluck it from thin air, the thing that is more problematic is knowing that SA for Project A in shard X is federated to clientID-1 (in some tenant), while SA for Project A in shard Y (different cluster, different tenant) is federated to clientID-2. Maintaining those mappings (e.g. in Project annotations) feels like it's potentially a lot for a Project admin to have to deal with. That poor UX makes me question how far we want to go down this rabbit hole.

juliusl commented 3 months ago

I see so the data can potentially move across Entra tenants. Today for ACR, we have customer's use this to support that scenario where they need to shift resources from tenant to tenant: https://learn.microsoft.com/en-us/azure/lighthouse/overview, however I'm not sure if that would fit into Kargo's model.

juliusl commented 3 months ago

So to summarize, the golden path would be if we could do,

K8's SA -> ARM -> MSI -> ACR

But today, we have to do an indirect mapping via annotation to go from SA/Project -> MSI. And so, the UX for administrators is creating a mapping for each SA.

juliusl commented 3 months ago

It looks like this project might have run into some of the same scenarios here, https://github.com/Azure/kubelogin, specifically it solves the SSO problem I think you're trying to solve, i.e. the SA attached to a Project is the same Identity that should be used to authenticate with the Registry. However, the focus of kubelogin is on the cluster user and not service accounts, but I think there might be some overlap in the approaches.

So, I guess the question is,

A) Is there a way for the Service Account identity be the Service Principal identity itself, in the same way a Cluster User is also the Entra User? Without actually testing it and just from researching documentation alone, this seems like it would be possible. If so, it would mean the UX turns into Project Administrators just needing to add a Service Account per Service Principal. And then manage that Service Principals access to registries.

B) Likewise, can the identity of a Service Account be the MSI itself? However, since MSI's are not cross-tenant, Kargo would need to handle the federation in the Kargo control plane in order to find the respective cluster/SA that has registry access. Not sure what mechanism is in place today to facilitate this, but I think it would have some of the same UX benefits.

For B), does Kargo facilitate data movement between completely different registries/clouds? i.e. A cluster from EKS or GKS would be able to ingest artifacts brought in from an AKS cluster?

Edit: I think it's probably more on this side, https://learn.microsoft.com/en-us/entra/identity-platform/v2-oauth2-client-creds-grant-flow#third-case-access-token-request-with-a-federated-credential https://learn.microsoft.com/en-us/entra/workload-id/workload-identity-federation-create-trust?pivots=identity-wif-apps-methods-azp

Following the example, In the portal you can also add a service account as a credential of your app registration, which lets you do the client assertion federation against Entra to get the ARM token which you then exchange for the ACR token.

There was also this announcement today which I think further enables this scenario, https://github.com/Azure/acr/blob/main/docs/blog/abac-repo-permissions.md

krancour commented 3 months ago

@juliusl are we overcomplicating things a bit here?

All we want to do is this, which looks pretty straightforward. We have the minor twist that we won't be using the ServiceAccount a controller runs as. Instead, the controller will use the k8s token request API to get a token for a Project(+Shard)-specific SA and then exchange that for the credentials of a managed identity to which that SA is federated.

Am I wrong to think that whatever access that managed ID requires is purely a matter of configuration on the Azure side? If it needs access to a registry in another sub, for instance, I don't think we see that as Kargo's problem to solve. I would assume this is solvable on the Azure end.

What we really don't want here is for our implementation to get super bogged-down in Azure details.

juliusl commented 3 months ago

@krancour it's possible, but let us regroup a bit.

All we want to do is this, which looks pretty straightforward. We have the minor twist that we won't be using the ServiceAccount a controller runs as. Instead, the controller will use the k8s token request API to get a token for a Project(+Shard)-specific SA and then exchange that for the credentials of a managed identity to which that SA is federated.

That is basically what my last recommendation above is pointing at. Using OIDC to allow the cluster to generate a token on behalf of the specific SA which Entra will accept. The only azure specific detail here would be assigning the Project(+Shard)-specific SA as a credential for the workload identity (in this case the workload identity is an app registration).

If it needs access to a registry in another sub, for instance, I don't think we see that as Kargo's problem to solve. I would assume this is solvable on the Azure end.

The question is if these two subscriptions live in the same Tenant. If they do, then it becomes fairly straight-forward because when you create managed identities, they will end up sharing the same directory and then it becomes fairly straightforward.

If they are not in the same Tenant, this becomes trickier and now extends beyond just configuration. In order to authenticate an identity you need a secret. Today with a Managed Identity (MI), this detail is managed by the machine itself, which means it ends up boiling down to just configuration as you suggested since now the secret handling is managed by IMDS via the token api.

However, say I have two Tenants, A and B. If my cluster is running in a subscription that belongs to Tenant A, there is no way for me to assign a MI that belongs to Tenant B to a machine running in Tenant A.

So, say I have a registry in Tenant B. In order for you to access that registry, you'll need an identity to grant registry access to, and that Tenant B recognizes. To authenticate you'll need a secret that belongs to that identity which Tenant B can validate. This is how it becomes more complicated, because if you are on Tenant A and you want to communicate with B, you'll need to store a secret that belongs to B in A, or you create a bridge between A and B (hand waving implementation details here).

This is why my last recommendation was for the OIDC approach. With that approach, you handle the multi-tenancy by saying that instead of purely relying on an MI created by Azure, you're relying on your workload identity (app registration) and authenticating it with your service account via the k8's token api.

I think we are aligned in this aspect, but my goal is just to point out a UX that doesn't involve having to manage additional labels, where the only action required by the admin is assigning the Project(+Shard)-specific SA as a credential of the app registration, so that the app registration can be assigned a role able to access the registry. This would then enable Kargo to use the client_assertion flow in order to grab a token generated by the cluster to authenticate with Entra to ultimately exchange with the registry.

Apologies, if I am misunderstanding the scenario and let me know if I'm not articulating this clearly enough.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it had no activity for 90 days. It will be closed if no activity occurs in the next 30 days but can be reopened if it becomes relevant again.

dtzar commented 2 weeks ago

@krancour - where did you land with the implementation on this? Anything Microsoft could do to help here?

blakepettersson commented 1 week ago

@dtzar we have had other priorities (e.g stabilizing Kargo to 1.0). What I initially proposed in #2399 does not hold up for our SaaS environment, so we would need to figure out a way to allow for this to work for both Kargo SaaS as well as Kargo OSS (one way could be to do something like @krancour suggested).

The dream would be to have something similar to "assume role" in GKE/AWS, which would make all of the work above unnecessary (since we could then just use the Managed Identity of the controller, and when needed temporarily elevate our access to get a project-specific Managed Identity).

PIM seems kind of what we want, although it is unclear to me what the status of it is, both in terms of stability and the ability for this to be universally used in Azure environments as well as the ability to use it programmatically. I'm no Azure expert so for me any guidance is welcome.