Timeout awaiting response headers on GKE with Workload Identity enabled

wmedlar commented 4 years ago

I've got a post-deploy Job running on GKE that pulls in secrets from Secret Manager with berglas that's been failing extremely frequently with:

failed to access secret /: failed to access secret: rpc error: code = Unauthenticated desc = transport: Get http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform: net/http: timeout awaiting response headers

We're running with Workload Identity enabled which allows a Kubernetes Service Account to act as a Google Service Account, a la kube2iam or kiam. Workload Identity retrieves the GSA tokens from a GKE Metadata Server Daemonset, however it seems to have difficulty with the pods from this Job:

2020-03-23T16:43:15.350593Z Syncing pod "aspirex/aspirex-ensure-default-member-fields-b2jzw" 
2020-03-23T16:43:15.350641Z Pod "aspirex/aspirex-ensure-default-member-fields-b2jzw" not found

The only thing unique about the Job is that it has restartPolicy=Never so we can better debug application failures. My theory is that berglas is timing out before Workload Identity can asynchronously retrieve a token, and a combination of the Job retrying in a new pod, the immediacy of the berglas exec call, and short client timeouts is leading to very frequent failures similar to #11.

I have some ideas for workarounds, but would it be possible to have retry logic or configurable headers for Secret Manager access?

sethvargo commented 4 years ago

Hmm - I think we might need to wrap the resolve call with the retry package.

wmedlar commented 4 years ago

I got the basic functionality working! Now the tough part is determining which errors are retryable. Got any advice here?

sethvargo commented 4 years ago

Any 5xx is safe. 409 is probably safe too.

stale[bot] commented 4 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 4 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

GoogleCloudPlatform / berglas

Timeout awaiting response headers on GKE with Workload Identity enabled #111