Closed wmedlar closed 4 years ago
Hmm - I think we might need to wrap the resolve
call with the retry package.
I got the basic functionality working! Now the tough part is determining which errors are retryable. Got any advice here?
Any 5xx is safe. 409 is probably safe too.
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I've got a post-deploy Job running on GKE that pulls in secrets from Secret Manager with berglas that's been failing extremely frequently with:
We're running with Workload Identity enabled which allows a Kubernetes Service Account to act as a Google Service Account, a la kube2iam or kiam. Workload Identity retrieves the GSA tokens from a GKE Metadata Server Daemonset, however it seems to have difficulty with the pods from this Job:
The only thing unique about the Job is that it has
restartPolicy=Never
so we can better debug application failures. My theory is that berglas is timing out before Workload Identity can asynchronously retrieve a token, and a combination of the Job retrying in a new pod, the immediacy of theberglas exec
call, and short client timeouts is leading to very frequent failures similar to #11.I have some ideas for workarounds, but would it be possible to have retry logic or configurable headers for Secret Manager access?