buildkite / elastic-ci-stack-for-aws

An auto-scaling cluster of build agents running in your own AWS VPC
https://buildkite.com/docs/quickstart/elastic-ci-stack-aws
MIT License
417 stars 268 forks source link

Retry s3.GetBucketRegion API #937

Open keithduncan opened 2 years ago

keithduncan commented 2 years ago

It has been reported in Slack that this API call can be an intermittent source of job failure.

This API call was added in v5.5.0 as part of supporting cross-region secrets buckets in the elastic-ci-stack-s3-secrets-hooks submodule.

keithduncan commented 2 years ago

Some guidance from S3 that all 5xx errors are retryable, recommending exponential backoff and up to 10 retries https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/

keithduncan commented 2 years ago

I took a look at aws-sdk-go-v2 to see if there had been any improvements in this area. It looks like some 5xx errors are retried by default.

Considering this affects artifact upload/download in the agent too I’m going to trial a migration in the s3secrets-helper to see how it goes.

pda commented 2 years ago

Yeah nice. I used aws-sdk-go-v2 a few times before it was GA, and it seemed nicer than v1. It was in beta/preview status for ages, I think I missed, or forgot, that it went GA with v1.0.0 on 2021-01-19 👍🏼

keithduncan commented 2 years ago

I’ve found docs on the AWS CLI v1 and v2 that says they already retry 503 errors so I’ve ticked the git-credentials-s3-secrets todo.

glittershark commented 2 years ago

Even at the current default of 3 retries, I've lately been seeing this fail an incredible amount (I just had to rerun a build 5 times); error message looks something like:

~~~ :llama: Setting up elastic stack environment (v5.7.2)

Checking docker

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

Checking disk space

Disk space free: 217G

Inodes free: 125M

Configuring built-in plugins

Secrets plugin enabled

ECR plugin enabled

Docker-login plugin enabled

Discovered current region as "us-east-2"

fatal error: Could not discover the region for bucket "<my-secrets-bucket>": (operation error S3: HeadBucket, exceeded maximum number of attempts, 3, https response error StatusCode: 503, RequestID: <snip>, HostID: <snip>, api error ServiceUnavailable: Service Unavailable)
keithduncan commented 2 years ago

:sob: sorry this has proven so unreliable @glittershark, can you confirm whether your case falls into the category of a bucket in the same region as the stack?

While this was added in service of cross-region bucket support, we can’t afford to have it impair same-region support like this, and I apologise.

We can revise this to support a final fallback to the discovered "current" region if dynamic discovery fails. While that won’t be functional in the cross region case, if the API is this flakey the only other option I can think of is requiring a stack parameter for the secret bucket’s region which bypasses dynamic discovery completely.

glittershark commented 2 years ago

@keithduncan oh, to be clear, I don't view this as your fault at all - from what I can tell this is entirely AWS being super flaky (unless I'm misunderstanding the issue) - regardless, super grateful for the awesome responsiveness you've had on all the issues in this repo :)

can you confirm whether your case falls into the category of a bucket in the same region as the stack?

Yeah, confirmed both the stack and the bucket are in us-east-2

We can revise this to support a final fallback to the discovered "current" region if dynamic discovery fails. While that won’t be functional in the cross region case, if the API is this flakey the only other option I can think of is requiring a stack parameter for the secret bucket’s region which bypasses dynamic discovery completely.

Personally I'd be totally happy with either of these options - passing an extra parameter is super straightforward, especially since all our stacks are in terraform.

geoffharcourt commented 2 years ago

We've started to see the frequency of this failure escalate quite a bit. We are also in us-east-2 and our secrets bucket is same-region.

keithduncan commented 2 years ago

We can revise this to support a final fallback to the discovered "current" region if dynamic discovery fails. While that won’t be functional in the cross region case, if the API is this flakey the only other option I can think of is requiring a stack parameter for the secret bucket’s region which bypasses dynamic discovery completely.

I’ve added both of these and they have landed in https://github.com/buildkite/elastic-ci-stack-for-aws/pull/962 which will be released in v5.7.3 shortly. The changes are minor beyond that change if any one is interested in using it pre-release by updating your stacks to use the master branch built template at https://s3.amazonaws.com/buildkite-aws-stack/master/aws-stack.yml