When a large number of pods try to mount secrets concurrently we run into client rate limit issues with error:
unable to obtain workload identity auth: unable to fetch SA info: client rate limiter Wait returned an error: context canceled
a large number of cron jobs starting at the same time
when upgrading the kubernetes cluster and pods are migrated to the new nodes
Expected behavior
If rate limiting occurs, there should be some retry logic
Observed behavior
The pods fail to start as they cannot mount the secret volume. Restarting a single pod manually it still fails with the same error. Restarting the secrets-store-csi-driver-provider-gcp daemonset resolved the issue. This had to be repeated multiple times until all pods successfully mounted their respective secrets. The issue manifested again the next time the cron jobs started so we staggered the start times. During cluster upgrades we had to keep an eye on all the pods and continually restart the secrets-store-csi-driver-provider-gcp daemonset to work through the issue.
Environment
provider version: v1.2.0
secret store version: v1.3.3
TL;DR
When a large number of pods try to mount secrets concurrently we run into client rate limit issues with error:
unable to obtain workload identity auth: unable to fetch SA info: client rate limiter Wait returned an error: context canceled
I Think the error is actually down to rate limiting on the go client used to query the kube-api server. Error seems to come from the part of the process where the secrets-store-gcp plugin queries the gcp service account annotation on the kubernetes service account used by the pod https://github.com/GoogleCloudPlatform/secrets-store-csi-driver-provider-gcp/blob/3ba36fc53b0a6f559558ba4d93cb7946ff82bfc3/auth/auth.go#L136
We have observed this error in two scenarios:
Expected behavior If rate limiting occurs, there should be some retry logic
Observed behavior The pods fail to start as they cannot mount the secret volume. Restarting a single pod manually it still fails with the same error. Restarting the secrets-store-csi-driver-provider-gcp daemonset resolved the issue. This had to be repeated multiple times until all pods successfully mounted their respective secrets. The issue manifested again the next time the cron jobs started so we staggered the start times. During cluster upgrades we had to keep an eye on all the pods and continually restart the secrets-store-csi-driver-provider-gcp daemonset to work through the issue.
Environment provider version: v1.2.0 secret store version: v1.3.3
Additional information We are using Pod Workload Identity This is also a known issue with the aws provider see https://github.com/aws/secrets-store-csi-driver-provider-aws/issues/136#issuecomment-1804050518