fluxcd / source-controller

The GitOps Toolkit source management component
https://fluxcd.io
Apache License 2.0
239 stars 187 forks source link

Using kubelet identity to access ACR OCI charts #1071

Closed gldraphael closed 1 year ago

gldraphael commented 1 year ago

I created a test cluster exp-aks-02 with the following configuration:

Kubernetes Version: 1.25.6
Authentication and Authorization: Azure AD authentication with Kubernetes RBAC
Network Plugin: Azure CNI

(The cluster does not use the ACR integration.)

I then went ahead and bootstrapped flux, and assigned ACR Pull and Reader permissions to the User Assigned Managed Identity exp-aks-02-agentpool on a ACR instance.

At this point, I expected it to just work, but flux get sources would show this error:

unknown build error: failed to get credential from azure: DefaultAzureCredential: failed to acquire a token.
Attempted credentials:
        EnvironmentCredential: missing environment variable AZURE_TENANT_ID
        ManagedIdentityCredential: no default identity is assigned to this resource
        AzureCLICredential: Azure CLI not found on path

Ideas?


Other Observations

Fetching token by specifying the UAI to use

I followed the thread at https://github.com/fluxcd/source-controller/issues/898 and concluded the reason this happens is because I have two UAIs (User Assigned managed Identities) attached to this cluster (exp-aks-02-agentpool and aciconnectorlinux-exp-aks-02).

So I tried patching the flux-system kustomization to add AZURE_CLIENT_ID:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - gotk-components.yaml
  - gotk-sync.yaml
labels:
  - pairs:
      toolkit.fluxcd.io/tenant: sre-team
patches:
  - patch: |
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --concurrent=20
      - op: add
        path: /spec/template/spec/containers/0/args/-
        value: --requeue-dependency=5s
    target:
      kind: Deployment
      name: "(kustomize-controller|helm-controller|source-controller)"
  - patch: |
      - op: add
        path: /spec/template/spec/containers/0/env/-
        value:
          name: AZURE_CLIENT_ID
          value: --client-id--
    target:
      kind: Deployment
      name: "(helm-controller|source-controller)"

But I now see this error (which almost feels like a bug):

unknown build error: failed to get credential from azure: error exchanging token: failed to decode the response: invalid character '<' looking for beginning of value

However hitting the token API directly works as long as I include the client_id parameter:

$ kubectl exec -it source-controller-59b5c97495-htrtb -n flux-system -- /bin/sh
$ wget -q -O - "http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://management.azure.com/&client_id=$AZURE_CLIENT_ID" --header "Metadata: true"
{"access_token":"--redacted--","client_id":"--client-id--","expires_in":"84928","expires_on":"1681412609","ext_expires_in":"86399","not_before":"1681325909","resource":"https://management.azure.com/","token_type":"Bearer"}

akv2k8s works ok

I am able to consume secrets from azure keyvault using the akv2k8s project which appears to use the userAssignedIdentityID value from /etc/kubernetes/azure.json:

apiVersion: v1
kind: Namespace
metadata:
  name: akv2k8s
  labels:
    toolkit.fluxcd.io/tenant: sre-team
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: akv2k8s
  namespace: akv2k8s
spec:
  interval: 60m0s
  url: https://charts.spvapi.no
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: akv2k8s
  namespace: akv2k8s
spec:
  interval: 60m
  chart:
    spec:
      chart: akv2k8s
      version: "2.*"
      sourceRef:
        kind: HelmRepository
        name: akv2k8s
      interval: 12h
  values:
    global:
      metrics:
        enabled: true
---
apiVersion: spv.no/v1alpha1
kind: AzureKeyVaultSecret
metadata:
  name: test-credentials
  namespace: monitoring
spec:
  vault:
    name: vault-name
    object:
      type: multi-key-value-secret
      name: test-credentials
      contentType: application/x-json
  output:
    secret:
      name: test-credentials
somtochiama commented 1 year ago

Which version of flux are you on? You can run flux version to check? I want to see if I can reproduce this error on my end, so any more details on how you set up the cluster would be great. We have e2e tests for kubelet identity (but the cluster uses system identity)

gldraphael commented 1 year ago

Flux version returns this:

~ ❯  flux version
flux: v0.41.2
helm-controller: v0.31.2
kustomize-controller: v0.35.1
notification-controller: v0.33.0
source-controller: v0.36.1

I created the cluster from the azure portal but I'm happy to put together a terraform script if that will help.

Edit:

I think to reproduce this, the cluster should use Azure AD, and the cluster should have more than one User Assigned Managed Identity. I'm validating this assumption right now.

gldraphael commented 1 year ago

I created a new cluster with a single User Assigned Managed Identity (UAI):

Node pools Node pools 1 Enable virtual nodes Disabled

Access Resource identity: System-assigned managed identity Local accounts: Disabled Authentication and Authorization: Azure AD authentication with Kubernetes RBAC Cluster admin group: Cluster Admin Encryption type: (Default) Encryption at-rest with a platform-managed key

Networking Network configuration: Kubenet Load balancer: Standard Private cluster: Disabled Authorized IP ranges: Disabled Network policy: None

Integrations Container registry: None Microsoft Defender for Cloud: Free Enable Container Logs: Disabled Alerts: Not enabled Azure Policy: Disabled

And I see the following error (which is similar to what I saw when I set AZURE_CLIENT_ID in the previous cluster with more than one UAI):

failed to get credential from azure: error exchanging token: failed to decode the response: invalid character '<' looking for beginning of value

Seems like this truly is a bug. Let me know if you have trouble reproducing this.

somtochiama commented 1 year ago

Hey, Sorry for the long wait. I just tested this on the latest version and it worked okay:

fleet-infra git:(main) flux -v
flux version 2.0.0-rc.2

I created an AKS cluster with the following properties (as stated in the previous comment)

Screenshot 2023-05-11 at 11 01 15

I assigned an AcrPull role to the cluster's managed identity and it reconciled successfully. Next, I added a second managed identity to the cluster and it failed to reconcile (which is expected):

► annotating OCIRepository podinfo in flux-system namespace
✔ OCIRepository annotated
◎ waiting for OCIRepository reconciliation
✗ OCIRepository reconciliation failed: 'failed to get credential from azure: DefaultAzureCredential: failed to acquire a token.
Attempted credentials:
        EnvironmentCredential: missing environment variable AZURE_TENANT_ID
        WorkloadIdentityCredential: missing environment variables for workload identity. Check webhook and pod configuration
        ManagedIdentityCredential: no default identity is assigned to this resource
        AzureCLICredential: Azure CLI not found on path

Then I added the AZURE_CLIENT_ID env variable to the source-controller pod and it reconciled successfully.

Can you try upgrading to 2.0.0-rc.2

gldraphael commented 1 year ago

Thanks for testing this out @somtochiama

I just tested it with v2.0.0-rc.3 but still see the same error unfortunately:

failed to get credential from azure: error exchanging token: failed to decode the response: invalid character '<' looking for beginning of value

I will try again on Monday just to be certain.

gldraphael commented 1 year ago

I am still seeing the same error. I see it when I add the following source:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: myacr
  namespace: experiments
spec:
  type: oci
  provider: azure
  url: oci://myacr.azurecr.io
  interval: 5m

Are you able to reproduce this?

somtochiama commented 1 year ago

I was testing using OCIRepository instead of HelmRepository. I will try again today

somtochiama commented 1 year ago

Hey @gldraphael ,

I have been able to reproduce this. Can you try specifying the repository in the URL i.e

spec:
  type: oci
  provider: azure
  url: oci://myacr.azurecr.io/<repo-name>
gldraphael commented 1 year ago

Well, that kinda works, but not quite. My chart is at oci://myacr.azurecr.io/clippy. Not at oci://myacr.azurecr.io/charts/clippy.

Earlier, I tried:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: myacr
  namespace: experiments
spec:
  type: oci
  provider: azure
  url: oci://myacr.azurecr.io
  interval: 5m
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: clippy
  namespace: experiments
spec:
  releaseName: clippy
  chart:
    spec:
      chart: clippy
      sourceRef:
        kind: HelmRepository
        name: myacr
      version: 1.0.1
  interval: 50m
  install:
    remediation:
      retries: 3
  values: {}

And that shows the error I reported earlier.

Now, I tried the following:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: clippy
  namespace: experiments
spec:
  type: oci
  provider: azure
  url: oci://myacr.azurecr.io/clippy
  interval: 5m
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: clippy
  namespace: experiments
spec:
  releaseName: clippy
  chart:
    spec:
      chart: clippy
      sourceRef:
        kind: HelmRepository
        name: clippy
        namespace: experiments
      version: 1.0.1
  interval: 50m
  install:
    remediation:
      retries: 3
  values: {}

I see no errors on the HelmRepository anymore, but the HelmChart shows the following error:

 ~/flux/flux get source chart experiments-clippy -n experiments
NAME                    REVISION        SUSPENDED       READY   MESSAGE

experiments-clippy                      False           False   chart pull error: failed to download chart for remote reference: failed to get 'oci://myacr.azurecr.io/clippy/clippy:1.0.1': myacr.azurecr.io/clippy/clippy:1.0.1: not found

It appears to be trying to get the chart from the wrong place: myacr.azurecr.io/clippy/clippy:1.0.1 instead of myacr.azurecr.io/clippy:1.0.1

I think a possible workaround may be to move my chart to myacr.azurecr.io/charts/clippy:1.0.1.

What I do not understand is why I no longer see any error on the HelmRepository when I use oci://myacr.azurecr.io/clippy as opposed to oci://myacr.azurecr.io. Does that URL always expect a base path after the origin?

somtochiama commented 1 year ago

I think a possible workaround may be to move my chart to myacr.azurecr.io/charts/clippy:1.0.1.

Yes, you would have to use this as a workaround while I get this fixed.

The HelmRepository should work with the repository root address but right now there's a bug that prevents it from doing so. When exchanging the token, it makes a request to index.docker.io due to some defaulting in a library we use. Thanks for reporting this!

gldraphael commented 1 year ago

Ah! Feel free let me know if you'd like me to test anything. Appreciate your patience here!

somtochiama commented 1 year ago

@gldraphael This issue will be fixed in the latest release of flux

gldraphael commented 1 year ago

@somtochiama - I tested this out, it works! Thanks!

joshuadmatthews commented 7 months ago

@gldraphael any advice for this when using the flux extension? I can't get this working either without setting the ClientID somehow, but because I'm using the flux extension there doesn't seem to be a way to cleanly patch the source controller manifests.

gldraphael commented 7 months ago

@joshuadmatthews - I have never used the Azure Flux extensions. I think the best thing to do would be to ask Azure Support if you haven't already. Their extensions should be covered, if I'm not mistaken. Let us know what they say here!

But since you asked for my advice, I'd say avoid the extensions as far as you can!

joshuadmatthews commented 7 months ago

I was able to get it working by deployed a patch with kubectl, which allows me to target a resource versus a manifest. It would be nice if flux had a way to apply patches directly versus having to patch a yaml file that is also in source control.

stefanprodan commented 7 months ago

@joshuadmatthews Flux can patch existing objects in-cluster, but being a GitOps tool, the patch must be specified in source control. Here is an example: https://fluxcd.io/flux/faq/#how-to-patch-coredns-and-other-pre-installed-addons

Also please note that we don't offer support for Azure extensions, you need to raise the ACR auth issue with Microsoft support. When installing Flux using flux bootstrap here is now you can set the ClientID: https://fluxcd.io/flux/installation/configuration/workload-identity/#azure-workload-identity

joshuadmatthews commented 7 months ago

Thanks @stefanprodan, good to know there is a method to match resources that weren’t originally added by flux.

With the Azure extensions, I did eventually find a document that described how to configure the extensions to setup workload identity.

gxy12280421 commented 5 months ago

@joshuadmatthews Did you get it working with the Azure flux-extension? Can you share the document about configuring the extension to setup workload identity?

I am using the Azure flux-extension and having the issue to authenticate to ACR with kubelet identity.

"error":"failed to get credential from 'azure': DefaultAzureCredential: failed to acquire a token.\nAttempted credentials:\n\tEnvironmentCredential: missing environment variable AZURE_TENANT_ID\n\tWorkloadIdentityCredential: no client ID specified. Check pod configuration or set ClientID in the options\n\tManagedIdentityCredential: failed to authenticate a system assigned identity. The endpoint responded with {\"error\":\"invalid_request\",\"error_description\":\"Multiple user assigned identities exist, please specify the clientId / resourceId of the identity in the token request\"}\n\tAzureCLICredential: Azure CLI not found on path\n\tAzureDeveloperCLICredential: Azure Developer CLI not found on path"

joshuadmatthews commented 5 months ago

@gxy12280421 see the Workload Identity section here

https://learn.microsoft.com/en-us/azure/azure-arc/kubernetes/tutorial-use-gitops-flux2?tabs=azure-cli

az k8s-extension create --resource-group --cluster-name --cluster-type managedClusters --name flux --extension-type microsoft.flux --config workloadIdentity.enable=true workloadIdentity.azureClientId=<user_assigned_client_id

You can do an update instead of a create if you already installed flux with Bicep/ARM.

gxy12280421 commented 5 months ago

@joshuadmatthews Thank you very much for the quick info which pointed me to the right direction. I got it working by adding useKubeletIdentity = "true" in the Azure flux extension since I assigned the ACRPull permission on the kubelet identity.