Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses

rahim commented 9 months ago

The default retry configuration for Fog::AWS::Storage https://github.com/fog/fog-aws/blob/b073ba81288af57bcce9144bdec4a77f5c6317b4/lib/fog/aws/storage.rb#L549

combined with Excon's defaults leads to a situation where a GET of an S3 object that may or may not exist will take 6 seconds to answer "no - the object was not found".

That retry configuration was added in https://github.com/fog/fog-aws/issues/674 solving for a batch upload scenario.

This can be reproduced with something like:

s3 = Fog::Storage.new(provider: "AWS")
bucket = s3.directories.new(key: 'some-bucket')
puts Benchmark.measure { file = bucket.files.get('some-non-existent-object') }

I have a script that does that for various retry configurations, example results:

Fog/Excon defaults (retry_limit: 5, retry_interval: 1)

  0.227765   0.018269   0.246034 (  6.307515)

Using retry_limit: 5, retry_interval: 0.1

  0.242438   0.018477   0.260915 (  2.534994)

Using retry_limit: 1, retry_interval: 1

  0.046131   0.003609   0.049740 (  0.457492)

Using retry_errors: [Excon::Error::Timeout,Excon::Error::Socket,Excon::Error::Server]

  0.049035   0.003061   0.052096 (  0.422489)

I noticed this because we were using an old version of fog that predated the changes to retry configuration. We back ported just that configuration and observed significant performance regression on a couple of endpoints where the probability of a GET to an S3 object that didn't exist were high.

Excon sets the following for retry_errors https://github.com/excon/excon/blob/85556aeb4af10e94f876a8cbdb764f0377fa0d3a/lib/excon/constants.rb#L19-L23

  DEFAULT_RETRY_ERRORS = [
    Excon::Error::Timeout,
    Excon::Error::Socket,
    Excon::Error::HTTPStatus
  ]

Somewhat weirdly, the inheritance tree for successful responses has them as descendants of Excon::Error. https://github.com/excon/excon/blob/85556aeb4af10e94f876a8cbdb764f0377fa0d3a/lib/excon/error.rb#L68-L125

Incomplete tree for some additional context:

StandardError
└──Excon::Error
   ├──Warning
   ├──Socket
   │  └──Certificate
   ├──InvalidHeaderKey
   ├──InvalidHeaderValue
   ├──Timeout
   ├──ResponseParse
   ├──ProxyConnectionError
   ├──ProxyParse
   ├──TooManyRedirects
   └──HttpStatus
      ├──Informational
      ├──Success
      │  ├──OK
      │  ├──Created
      │  ├──Accepted
      │  └──[...]
      ├──Redirection
      ├──Client
      │  ├──BadRequest
      │  ├──Unauthorized
      │  ├──PaymentRequired
      │  ├──Forbidden
      │  ├──NotFound
      │  ├──MethodNotAllowed
      │  ├──NotAcceptable
      │  ├──ProxyAuthenticationRequired
      │  ├──RequestTimeout
      │  ├──Conflict
      │  ├──Gone
      │  └──[...]
      └──Server
         ├──InternalServerError
         ├──BadGateway
         ├──ServiceUnavailable
         └──GatewayTimeout

It's the inclusion of Excon::Error::Client here that I question, almost all of this class of error seem like problems that won't be helped with a retry, perhaps with the exception of RequestTimeout and TooManyRequests and all of these will behave similarly to NotFound, causing performance to regress from <20ms to >6000ms.

I'll open up a PR here to patch the default configuration of retry_errors in this project, but it's certainly debatable whether the defaults being inherited are something that should be changed upstream in Excon itself.

I do also wonder on reflection whether retry_limit: 5, retry_interval: 1 solved too narrowly for one use case - 1 second is an awfully long time in some contexts, particularly when response times for some S3 calls could be only a few milliseconds.

geemus commented 9 months ago

Thanks for the detailed description. It's been some time, but if I recall correctly, I think this was intended to provide ease of use around the eventually consistent nature of S3. ie for a use case like:

s3 = Fog::Storage.new(provider: "AWS")
bucket = s3.directories.new(key: 'some-bucket')
file = bucket.files.create(key: 'some-new-object', value: 'some-new-value')
puts bucket.files.get('some-non-existent-object')

The idea being that this helps avoid that raising an error if you have reason to believe it is already there. This doesn't really consider the case where that is being used to check for existence though, which as you suggest is made much worse for this.

In any event, I think 404 is really the only client error that we intended to catch/retry (for the eventual consistency case), and this may also be why the retries are somewhat long (as this tends to take somewhat longer to resolve).

It seems like we may want this to be more tunable and/or different for different calls, but I'm not entirely sure what the best approach is to maintain existing behavior and fix for this case. I wonder if maybe we should use HEAD for existence checks instead, which would give us a place to apply different settings, for instance. It also has the benefit of allowing existence checks to be faster/cheaper, even for rather large objects. What do you think? I'm happy to discuss, and will look at the PRs presently.

rahim commented 9 months ago

I think this was intended to provide ease of use around the eventually consistent nature of S3

💡

That's not something that had occurred to me.

I'll have a think, like you say, it's not immediately clear what the best approach is.

rajyan commented 9 months ago

First of all, thank you for opening this issue. I wasn't aware enough of the cause with the change in https://github.com/fog/fog-aws/pull/675, because our service mainly uploads bunch of files, and we never access to 404 keys.

For the problems that @geemus is mentioning, AWS S3 now supports strong consistency, and I believe that 404 scenario won't happen any more (I will test it later).

In any event, I think 404 is really the only client error that we intended to catch/retry (for the eventual consistency case)

so, I now vote for configuring defaults to only retry on non client errors.

rajyan commented 9 months ago

For the hierarchy of HtttpStatus > Success, I found an explanation here https://github.com/excon/excon/issues/694

geemus commented 9 months ago

Oh, interesting. I definitely missed the memo on strong consistency. That does seem to remove the need for client error retries (or at least I don't recall any other real reason for it).

geemus commented 9 months ago

Although, that being said. I think other AWS services historically had similar issues at times. ie I think EC2 wouldn't always return objects after create. Anyone know if that is true or not? I wasn't able to find an easy references with web search.

rajyan commented 9 months ago

@geemus

Found this https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency and it's seems that EC2 API follows an eventual consistent model.

I think EC2 wouldn't always return objects after create.

so this is still true for EC2.

Although, isn't this https://github.com/fog/fog-aws/pull/691/files config only related to S3?

rajyan commented 9 months ago

For additional information boto retries on

https://github.com/boto/botocore/blob/d95d441ef9b81ac4d0163fa97e736edf63a8b4b3/botocore/retryhandler.py#L105-L122

general errors https://github.com/boto/botocore/blob/d95d441ef9b81ac4d0163fa97e736edf63a8b4b3/botocore/data/_retry.json#L100-L111

and some specific s3 client errors https://github.com/boto/botocore/blob/d95d441ef9b81ac4d0163fa97e736edf63a8b4b3/botocore/data/_retry.json#L220-L235

geemus commented 9 months ago

@rajyan Great find on those docs, good point on the change only impacting S3, and thanks for the boto examples. Your references have been super helpful in thinking through this and figuring out a way forward, I really appreciate the help.

It seems like it's becoming clearer and clearer that this would be a good safe change, at least for S3 (and quite possibly for excon, though we might then want to add client errors for the eventual consistency stuff into ec2). I think in the EC2 case, even though there could be similar performance issues to what we've been discussing, the use cases are SO much different that it probably still makes sense.

I will try to circle back to this, hopefully tomorrow if not in the next couple days, and make a call about getting some of this in. It's just getting late in the day and I have a one month old in the house, so I want to make sure I can come to this a bit fresher with more focus/energy.

rahim commented 9 months ago

@rajyan thank you for the links (both on S3 consistency and boto's behaviour) - both really useful.

boto's declarative approach is very nice, it's notable that:

for S3 it does not retry on 404
it has affordance for configuring delay, including support for exponential backoff
where retry applies, the base default includes exponential backoff and some jitter

It feels like exponential backoff support would be a nice addition to Excon, possibly a good default when users of Excon choose retry behaviour.

I think EC2 wouldn't always return objects after create. Anyone know if that is true or not? I wasn't able to find an easy references with web search.

@geemus https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency was the most relevant thing I found, it certainly reads that way.

It's worth noting the default retry behaviour as it currently stands:

Excon: retry_limit: 4, retry_interval: 0, retry_errors: [Timeout, Socket, HTTPStatus] (retry depends on Idempotent middleware being active) => a very broad range of errors will be retried as fast as possible up to four times, for many services this will all happen so quickly as to have minimal impact on overall response time, but for any situation where the remedy is "wait a bit longer" is likely be minimally helpful.
fog-aws: mostly inherits the Excon defaults, with exceptions for:
- CredentialFetcher#fetch_credentials_token_header (retry_limit: 3, retry_interval: 1)
- Storage (retry_limit: 5, retry_interval: 1)
- Storage::Directory#destroy! implements a wait for consistency using Fog#wait_for (that was a discovery for me, I was talking about similar over in https://github.com/excon/excon/issues/833#issuecomment-1738771585)
- Compute implements its own retry with jitter specifically to handle RequestLimitExceeded
other fog cloud libraries: From some very basic grepping, it looks like these mostly manually reimplement retry in the few cases where someone's decided it matters for the service at hand. Mostly, they're just inheriting the Excon default behaviour.

It's just getting late in the day and I have a one month old in the house

Congratulations! :smile:

That's far more important than this, I hope you get some sleep!

geemus commented 9 months ago

@rahim thanks for pulling together that overview of the current defaults.

I've merged the suggested changes for S3 in particular at least, thanks to everyone for contributing to the discussion and talking through it.

fog / fog-aws

Fog::AWS::Storage default retry behaviour guarantees 6s delay for 4xx class responses #690