Azure / azure-workload-identity

Azure AD Workload Identity uses Kubernetes primitives to associate managed identities for Azure resources and identities in Azure Active Directory (AAD) with pods.
https://azure.github.io/azure-workload-identity
MIT License
298 stars 95 forks source link

azure.workload.identity/inject-proxy-sidecar blocks jobs in Kubernetes #773

Open dozer75 opened 1 year ago

dozer75 commented 1 year ago

Describe the bug

We have some jobs and cronjobs that is running in AKS that connects to an Azure SQL database using ODBC. We are planning to use Managed Identity and workload identity to do the authentication in the ODBC driver, and for this we need to use the injection of the proxy sidecar (for some reason).

But, by doing this, the job won't end after the job container has successfully completed since the sidecar proxy is still alive after our container is done.

The job pod is in state NotReady as the proxy container is still running.

NAME                                READY   STATUS     RESTARTS   AGE
pod/job-onetime-42mx8   1/2     NotReady   0          26m

Here is the dump of the pod:

Name:             job-onetime-42mx8
Namespace:        job-jobs
Priority:         0
Service Account:  job-serviceaccount
Node:             aks-systempool-85002938-vmss000002/10.1.0.4
Start Time:       Wed, 01 Mar 2023 16:30:24 +0100
Labels:           azure.workload.identity/use=true
                  controller-uid=9f307dd8-31d6-4c03-b482-563d7fece75e
                  job-name=job-onetime
Annotations:      azure.workload.identity/inject-proxy-sidecar: true
Status:           Running
IP:               10.2.0.19
IPs:
  IP:           10.2.0.19
Controlled By:  Job/job-onetime
Init Containers:
  azwi-proxy-init:
    Container ID:   containerd://148eba569dabca978fb85499303ec4b8de859b6300957962ad5ba9fbf2773008
    Image:          mcr.microsoft.com/oss/azure/workload-identity/proxy-init:v0.15.0
    Image ID:       mcr.microsoft.com/oss/azure/workload-identity/proxy-init@sha256:e8064cf26147bb98efe33c5bc823eb3b32c6b0cbf93619fa6b5d72f4f7a7c068
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 01 Mar 2023 16:30:25 +0100
      Finished:     Wed, 01 Mar 2023 16:30:25 +0100
    Ready:          True
    Restart Count:  0
    Environment:
      PROXY_PORT:                  8000
      AZURE_CLIENT_ID:             
      AZURE_TENANT_ID:             
      AZURE_FEDERATED_TOKEN_FILE:  /var/run/secrets/azure/tokens/azure-identity-token
      AZURE_AUTHORITY_HOST:        https://login.microsoftonline.com/
    Mounts:
      /var/run/secrets/azure/tokens from azure-identity-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nwvjg (ro)
Containers:
  job-onetime:
    Container ID:   containerd://0b3d851e1618602fb09b0d5ca0d49ae265fea720d19d7ef45608face3495fd85
    Image:          registry/job-onetime:local-20230301.01
    Image ID:       registry/job-onetime@sha256:6da19c0f646ae6d245c9f2c342d9da961540f92af3968bfcf5d58fb4015da501
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 01 Mar 2023 16:30:26 +0100
      Finished:     Wed, 01 Mar 2023 16:30:29 +0100
    Ready:          False
    Restart Count:  0
    Environment:
      CONNECTION_STRING:           <set to the key 'connection-string' in secret 'job-secrets'>  Optional: false
      AZURE_CLIENT_ID:             
      AZURE_TENANT_ID:             
      AZURE_FEDERATED_TOKEN_FILE:  /var/run/secrets/azure/tokens/azure-identity-token
      AZURE_AUTHORITY_HOST:        https://login.microsoftonline.com/
    Mounts:
      /mnt/secrets-store from job-secrets-store (ro)
      /var/run/secrets/azure/tokens from azure-identity-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nwvjg (ro)
  azwi-proxy:
    Container ID:  containerd://d444c6294a2cf71bb7564e81f3b2d4677e29a6cd59f54fb5edc415f55890512c
    Image:         mcr.microsoft.com/oss/azure/workload-identity/proxy:v0.15.0
    Image ID:      mcr.microsoft.com/oss/azure/workload-identity/proxy@sha256:809dea7d3099c640a7d0b87f63092c97177992cb47abb141b6a6203feb32d071
    Port:          8000/TCP
    Host Port:     0/TCP
    Args:
      --proxy-port=8000
    State:          Running
      Started:      Wed, 01 Mar 2023 16:30:27 +0100
    Ready:          True
    Restart Count:  0
    Environment:
      AZURE_CLIENT_ID:             
      AZURE_TENANT_ID:             
      AZURE_FEDERATED_TOKEN_FILE:  /var/run/secrets/azure/tokens/azure-identity-token
      AZURE_AUTHORITY_HOST:        https://login.microsoftonline.com/
    Mounts:
      /var/run/secrets/azure/tokens from azure-identity-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nwvjg (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  job-secrets-store:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            secrets-store.csi.k8s.io
    FSType:
    ReadOnly:          true
    VolumeAttributes:      secretProviderClass=job-kv-secrets
  kube-api-access-nwvjg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  azure-identity-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3600
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason                  Age    From                        Message
  ----    ------                  ----   ----                        -------
  Normal  Scheduled               3m12s  default-scheduler           Successfully assigned job-jobs/job-onetime-42mx8 to aks-systempool-85002938-vmss000002
  Normal  Pulled                  3m12s  kubelet                     Container image "mcr.microsoft.com/oss/azure/workload-identity/proxy-init:v0.15.0" already present on machine
  Normal  Created                 3m12s  kubelet                     Created container azwi-proxy-init
  Normal  Started                 3m12s  kubelet                     Started container azwi-proxy-init
  Normal  Pulling                 3m11s  kubelet                     Pulling image "registry/job-onetime:local-20230301.01"
  Normal  Pulled                  3m11s  kubelet                     Successfully pulled image "registry/job-onetime:local-20230301.01" in 228.98641ms
  Normal  Created                 3m11s  kubelet                     Created container job-init
  Normal  Started                 3m11s  kubelet                     Started container job-init
  Normal  Pulled                  3m11s  kubelet                     Container image "mcr.microsoft.com/oss/azure/workload-identity/proxy:v0.15.0" already present on machine
  Normal  Created                 3m10s  kubelet                     Created container azwi-proxy
  Normal  Started                 3m10s  kubelet                     Started container azwi-proxy

Steps to reproduce

Expected behavior The best would of course be that ODBC works with the default flow, but somehow it doesn't so that we need to use the sidecar.

The sidecar should be stopped whenever the other container(s) in the pod has been completed enabling the job to complete.

Logs

Environment

Additional context

san7hos commented 1 year ago

This also happens if you use Argo Workflows and try to inject in the workflows. The proxy keeps running while all containers have exited. The Argo documentation describes how they handle injected sidecars. The issue seems to be that they try to send a kill signal using kubectl exec and that fails. There is a way how to customize it but the azwi-proxy is based on a very thin linux distroless base image that has no shell.

One option to resolve this, at least from my point of view, would be to compile the proxy with the option to terminate itself.

leongyh commented 1 year ago

Took me days to figure this one out. As @san7hos mentions, if you could issue a pkill to the sidecar proxy, then that should do it. But the base image uses barebones distroless. The azwi webhook helm chart also doesn't really give you much options to change the proxy image even if you decided to build your own.

I considered other options to gracefully kill the sidecar. One potential was to use the OpenKruise Job Sidecar Terminator. But in order for that to work, the proxy container needs an environment variable injected. Again, The azwi webhook helm chart also doesn't really give you any options to do so.

Like always, I had to scour the depths of the internet for bits and pieces of poorly written Azure documentation that are sprawled here and there, put them together, to figure out a solution. The solution was actually to properly use azwi as intended. Rather than use the proxy sidecar to intercept the IDMS endpoint when odbc tries to authenticate via the Msi method, just use the projected service account token to authenticate.

I used msal to get the token, followed this half-baked solution, and authenticate to the database via access token.

Here is sample code I tested with on a pod without a sidecar. Disclaimer: I only tested on pyodbc.

devjoes commented 2 months ago

If the mutating webhook controller used native sidecar containers then I think it would resolve this (plus some annoying issues with the proxy being the default container). I might raise a PR