aws / amazon-eks-pod-identity-webhook

Amazon EKS Pod Identity Webhook
Apache License 2.0
608 stars 175 forks source link

Standard Label for Excluding Pod from Identity Mounting #215

Closed DerekTBrown closed 2 weeks ago

DerekTBrown commented 6 months ago

What would you like to be added?

It would be useful to have a standard label that excludes pods from being processed by the identity webhook.

Proposed Change: https://github.com/aws/amazon-eks-pod-identity-webhook/pull/216

Why is this needed?

  1. Removing Cyclical Dependencies - When installing a CNI and the Pod Identity Webhook onto our clusters, we identified a cyclical dependency; the identity webhook needs the network/CNI to be successfully operating to be able to communicate with the backend service, and the CNI needs the identity webhook to be functioning in order to schedule pods. The CNI doesn't rely on Amazon identities, so having a mechanism to skip the identity webhook for some pods (including the CNI) would remove this cyclical dependency.

  2. Performance/Reliability - Having the ability to exclude pods that don't need Amazon identities can reduce load on the webhook and improve pod launching performance. This also removes a critical dependency for services that need to be resilient to failure.

  3. Standardization is key to compatibility- Having a canonical tag for skipping the identity webhook will make it easier for downstream projects to exclude critical components (such as CNIs) from the webhook, leading to a better out-of-the-box experience.

jeevanions commented 5 months ago

Can't we simply avoid this by not using the Service Account that is enabled for Pod Identity webhook?

DerekTBrown commented 5 months ago

Can't we simply avoid this by not using the Service Account that is enabled for Pod Identity webhook?

I am not sure I understand your suggestion. How do Service Accounts fit into the mutating webhook flow?

fungusakafungus commented 3 months ago

What exactly prevents CNI from coming up? This webhook sets failurePolicy: Ignore https://github.com/aws/amazon-eks-pod-identity-webhook/blob/master/deploy/mutatingwebhook.yaml#L10 So I'd expect that it will not prevent other pods form starting if it is itself down.

DerekTBrown commented 3 months ago

What exactly prevents CNI from coming up? This webhook sets failurePolicy: Ignore

At least in our case, we rely on kops to install the webhook, which sets this to Fail:

https://github.com/kubernetes/kops/blob/5befa3fb96f35692411772152e2d1c22850dbc90/upup/models/cloudup/resources/addons/eks-pod-identity-webhook.addons.k8s.io/k8s-1.16.yaml.template#L157

Isn't this a kops issue then?

IMO, eks-pod-identity-webhook should expect downstream consumers to change the webhook to use a Fail policy, and support features to enable this usecase.

Having a webhook with the failurePolicy: Ignore shifts the failure mode from an infra-level failure (where infrastructure/platform teams can detect and remediate the actual issue) to a service-level failure (where now individual services are responsible for detecting that they don't have adequate information, and then escalating this failure to the infrastructure/platform team actually responsible for the webhook). This is an undesirable behavior for many (if not most) users.

DerekTBrown commented 3 months ago

What exactly prevents CNI from coming up? This webhook sets failurePolicy: Ignore

One other thing I realized I should add: I could see a case as well where a customer wants to disable the Pod Identity Webhook as a performance optimization, independent of the failure mode. For example, if leveraging DoEKS, they may not need IAM roles on Spark worker pods.

modulitos commented 2 weeks ago

Looks like this issue was resolved in the proposed PR linked above (https://github.com/aws/amazon-eks-pod-identity-webhook/pull/216)

I appreciate the feedback on failurePolicy: Ignore - looks like it was changed from the default value of Fail to Ignore here: https://github.com/aws/amazon-eks-pod-identity-webhook/pull/3/files We can look into the tradeoffs of changing it back, but it's a separate issue.

Marking this issue as closed.