Open ahuffman opened 1 year ago
I am experiencing the same issue. Our cluster is a Kops managed cluster. We deployed a service that had two replicas. We noticed that one of the pod was able to access the s3 bucket but the other wasn't. On investigation, the pod that was not able to access the bucket didn't have AWS_ROLE_ARN
AWS_WEB_IDENTITY_TOKEN_FILE
environment variables. The pod that was able to access the s3 bucket had these environment variables. I checked both pods had the same service account. When I deleted the pod with the missing environment variables, the new pod created had these two environment variables.
My kubernetes version is v1.22.5
. The amazon-eks-pod-identity-webhook
version is as follows
Image: amazon/amazon-eks-pod-identity-webhook:latest Image ID: docker-pullable://amazon/amazon-eks-pod-identity-webhook@sha256:4a3ff337b6549dd29a06451945e40ba3385729c879f09f264d3e37d06cbf001a
Any information will be highly appreciated.
This looks very similar to an issue we are hitting. Here was our conclusion (courtesy of @strongmonkey):
When diving deep, this is because a race condition in aws-identity pod where serviceaccount and pod get created at the same time. It uses cache to fetch serviceaccount, which might not be ready when the pod was created.
@jsilverio22 should be able to share a WIP PR soon.
In our case, we are probably exacerbating the problem by:
In my case, I'm running on EKS, however I'm performing my deployments via Helm charts, where the ServiceAccounts and related annotations are being deployed via the values at the same time (using the charts).
I also tried pre-provisioning the ServiceAccount with the annotations, but it did not change the behavior, and lead me to believe it was something else altogether.
In a similar situation to @cjellick , my entire cluster provisioning process is being done programmatically via Crossplane. Just to reiterate, I do not believe it's a problem with my configurations, because when I delete the pods after their first instantiation everything works fine, but the initial deployment of the pods do not pick up their IRSA privileges.
I'm seeing this problem on simple re-deployments. Like a deployment recreates the pods for reasons (like moving nodes) and environment variables simply won't be put in place. I'm running the pod identity webhook on multiple nodes so there should always be one online to respond.
This is a blocker for us to adopt pod identity (as a switch from IRSA). We create IAM Role
(using ACK controller), PodIdentityAssociation
, ServiceAccount
and Deployment
in the same helm chart. On initial install, pods come up without the identity mutation, winning a race we want them to lose. This occurs even hardcoding roleARN
in the PodIdentityAssociation
, ie we are not waiting for status on the Role
.
Installing PodIdentityAssociation
in a helm pre-install hook does not help: the resource comes up and is ready very quickly, but our pods still come up before mutation is ready.
Waiting a minute and restarting Deployment gives us new pods with correct mutations.
We are use using Argocd to deploy our services and are also looking at pod identity. Our services are deployed using a Helm chart and we added a PodIdentityAssociation
custom resource to our Helm chart so the ACK EKS controller can create the association for us (the IAM role is created with Terraform before the service is deployed).
To try and solve the race condition, we added an Argocd preSync hook on the PodIdentityAssociation
customer resource (source). But even with the preSync hook, the race condition still exists like @rlister mentioned.
In an effort to solve this issue, we added a customisation to Argocd and only mark the PodIdentityAssociation
custom resource as Healthy
when an associationId
is filled out in the status field of the PodIdentityAssociation
custom resource (source). With this in place, the pod identity association is created before the deployment is created but the necessary environment variables on the pods are still not injected and we still have to restart our pods for it to work.
We verified that the Argocd customisation works by downscaling the ACK EKS controller to zero before creating the Argocd application of the service. When the Argocd application of the service is added to Argocd, all the resources are out of sync and Argocd is waiting for the PodIdentityAssociation
to be Healthy
. Argocd shows the message waiting for completion of hook eks.services.k8s.aws/PodIdentityAssociation/<service-name>
until the pod identity association is created. When ACK EKS controller is scaled up, the pod identity association is created and the rest of the Argocd application is synced but the pods still don't have the necessary environment variables injected until the pods are restarted.
What happened: Deploying multiple services on my cluster such as cluster-autoscaler, external-dns, ebs-csi-drivers. On initial deployment the pods do not receive the environment vars, volumes, and volumeMounts.
When I manually delete the affected pods after automated deployment, I get everything as expected from the webhook.
I've followed every possible AWS document on troubleshooting IRSA. I initially thought it could be a race condition post cluster instantiation, but I tested delaying the deployments as long as 10 minutes and the results are the same.
What you expected to happen: Environment vars, volumes, and volumeMounts are injected into the deployment's pod specs without need for manual deletion of the pods.
How to reproduce it (as minimally and precisely as possible): Create an EKS cluster, create an IAM OIDC provider, Create a IAM Policy, Create a Role, Attach the Policy to the Role, Create a trust relationship in the role referring to the OIDC provider and the Kubernetes service account, and finally do a helm release with appropriate values to specify the corresponding namespace, service account name, and annotations required for the roleARN to tie it all together.
Anything else we need to know?: Not that I can think of, but feel free to ask for more :).
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.3aws eks describe-cluster --name <name> --query cluster.version
): 1.24 (also tested 1.23 with same result)