GoogleCloudPlatform / secrets-store-csi-driver-provider-gcp

Google Secret Manager provider for the Secret Store CSI Driver.
Apache License 2.0
239 stars 61 forks source link

Client side rate limiting #383

Open rob-whittle opened 7 months ago

rob-whittle commented 7 months ago

TL;DR

When a large number of pods try to mount secrets concurrently we run into client rate limit issues with error: unable to obtain workload identity auth: unable to fetch SA info: client rate limiter Wait returned an error: context canceled

I Think the error is actually down to rate limiting on the go client used to query the kube-api server. Error seems to come from the part of the process where the secrets-store-gcp plugin queries the gcp service account annotation on the kubernetes service account used by the pod https://github.com/GoogleCloudPlatform/secrets-store-csi-driver-provider-gcp/blob/3ba36fc53b0a6f559558ba4d93cb7946ff82bfc3/auth/auth.go#L136

We have observed this error in two scenarios:

  1. a large number of cron jobs starting at the same time
  2. when upgrading the kubernetes cluster and pods are migrated to the new nodes

Expected behavior If rate limiting occurs, there should be some retry logic

Observed behavior The pods fail to start as they cannot mount the secret volume. Restarting a single pod manually it still fails with the same error. Restarting the secrets-store-csi-driver-provider-gcp daemonset resolved the issue. This had to be repeated multiple times until all pods successfully mounted their respective secrets. The issue manifested again the next time the cron jobs started so we staggered the start times. During cluster upgrades we had to keep an eye on all the pods and continually restart the secrets-store-csi-driver-provider-gcp daemonset to work through the issue.

Environment provider version: v1.2.0 secret store version: v1.3.3

Additional information We are using Pod Workload Identity This is also a known issue with the aws provider see https://github.com/aws/secrets-store-csi-driver-provider-aws/issues/136#issuecomment-1804050518

tuusberg commented 1 month ago

What was the total number of pods in your scenario? Trying to better understand the definition of "large" :)