Open keithduncan opened 2 years ago
Some guidance from S3 that all 5xx errors are retryable, recommending exponential backoff and up to 10 retries https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/
I took a look at aws-sdk-go-v2 to see if there had been any improvements in this area. It looks like some 5xx errors are retried by default.
Considering this affects artifact upload/download in the agent too I’m going to trial a migration in the s3secrets-helper to see how it goes.
Yeah nice. I used aws-sdk-go-v2 a few times before it was GA, and it seemed nicer than v1. It was in beta/preview status for ages, I think I missed, or forgot, that it went GA with v1.0.0 on 2021-01-19 👍🏼
I’ve found docs on the AWS CLI v1 and v2 that says they already retry 503 errors so I’ve ticked the git-credentials-s3-secrets
todo.
Even at the current default of 3 retries, I've lately been seeing this fail an incredible amount (I just had to rerun a build 5 times); error message looks something like:
~~~ :llama: Setting up elastic stack environment (v5.7.2)
Checking docker
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
Checking disk space
Disk space free: 217G
Inodes free: 125M
Configuring built-in plugins
Secrets plugin enabled
ECR plugin enabled
Docker-login plugin enabled
Discovered current region as "us-east-2"
fatal error: Could not discover the region for bucket "<my-secrets-bucket>": (operation error S3: HeadBucket, exceeded maximum number of attempts, 3, https response error StatusCode: 503, RequestID: <snip>, HostID: <snip>, api error ServiceUnavailable: Service Unavailable)
:sob: sorry this has proven so unreliable @glittershark, can you confirm whether your case falls into the category of a bucket in the same region as the stack?
While this was added in service of cross-region bucket support, we can’t afford to have it impair same-region support like this, and I apologise.
We can revise this to support a final fallback to the discovered "current" region if dynamic discovery fails. While that won’t be functional in the cross region case, if the API is this flakey the only other option I can think of is requiring a stack parameter for the secret bucket’s region which bypasses dynamic discovery completely.
@keithduncan oh, to be clear, I don't view this as your fault at all - from what I can tell this is entirely AWS being super flaky (unless I'm misunderstanding the issue) - regardless, super grateful for the awesome responsiveness you've had on all the issues in this repo :)
can you confirm whether your case falls into the category of a bucket in the same region as the stack?
Yeah, confirmed both the stack and the bucket are in us-east-2
We can revise this to support a final fallback to the discovered "current" region if dynamic discovery fails. While that won’t be functional in the cross region case, if the API is this flakey the only other option I can think of is requiring a stack parameter for the secret bucket’s region which bypasses dynamic discovery completely.
Personally I'd be totally happy with either of these options - passing an extra parameter is super straightforward, especially since all our stacks are in terraform.
We've started to see the frequency of this failure escalate quite a bit. We are also in us-east-2 and our secrets bucket is same-region.
We can revise this to support a final fallback to the discovered "current" region if dynamic discovery fails. While that won’t be functional in the cross region case, if the API is this flakey the only other option I can think of is requiring a stack parameter for the secret bucket’s region which bypasses dynamic discovery completely.
I’ve added both of these and they have landed in https://github.com/buildkite/elastic-ci-stack-for-aws/pull/962 which will be released in v5.7.3 shortly. The changes are minor beyond that change if any one is interested in using it pre-release by updating your stacks to use the master branch built template at https://s3.amazonaws.com/buildkite-aws-stack/master/aws-stack.yml
It has been reported in Slack that this API call can be an intermittent source of job failure.
This API call was added in v5.5.0 as part of supporting cross-region secrets buckets in the elastic-ci-stack-s3-secrets-hooks submodule.