orthanc commented 6 years ago

Proposal: Implicit Retry for Boto Resources and Clients in AWS Module Utils

Author: Ed Costello <@orthanc> IRC: orthanc

Date: 2018/07/14

Status: New
Proposal type: AWS module practice
Targeted Release: TODO
Estimated time to implement: Small

Attn: AWS Working Group (@shertel, @ryansb, @mikedlr, @willthames)

Motivation

Most of AWS APIs are subject to some form of throttling. The current best practice for handling this in AWS modules is to use the AWSRetry decorator to add a backoff and retry to AWS calls.

As this is a decorator it must be applied to methods, and importantly must be applied within any error handling since the decorator must be able to catch and process the error.

This effectively means that in order to use the AWSRetry functionality each AWS API method call must be encapsulated in a method, but this encapsulation is incomplete since the error handling must be done outside the decorated method.

Part of ansible/ansible#36623 was adding an implicit retry support for uses of the boto3 client variant of the API (Details described in API Throttling and Pagination ). The same problem exists for the higher level boto3 resource API but was not addressed as part of 36623.

The goal of this proposal is to get feedback on possible options for how to expand this implicit retry concept to boto3 resources so the focus is on the UX for developers of modules rather than hos to implement.

Problems

The key problems that the implicit retries are solving are:

Verbose boiler plate code needed to introduce retries to deal with throttling (using AWSDecorators)
Requirement to restate the retry limits in multiple places throughout the module
Lack of jitter support and non-obviousness of increasing boto3's internal retry

The developer UX introduced for implicit retries on boto3 clients doesn't work for resources because:

Various resource properties trigger API interactions in the background, as these are properties there's no way to pass aws_retry=True to enable retries
API calls don't always happen where people expect so it's not obvious where to specify aws_retry=True e.g. listing operations often don't make an AWS call until the collection is interacted with
Using resources usually involves multiple method calls, any of which could make the API calls so having to pass aws_retry=True to all of these is a verbosity problem of it's own

So the main question is how to indicate that a particular set of resource interactions should be retried.

It may be that extending implicit retries to resources creates more confusion that it's wroth and we'd be better limiting this functionality to the low level clients and using the method decorators for resource based retries.

Solution proposal

Use python context managers to indicate retry is required.

Example:

# Creating Resource
s3 = boto3_conn(None, conn_type='resource', resource='s3', region='us-east-1', retry_decorator=AWSRetry.jittered_backoff())

# With Retry
with AWSRetry.implicit:
    s3.Bucket('mybucket').upload_file('/tmp/hello.txt', 'hello.txt')

Retrying would actually be done on the client call's that back the resource. Essentially this would be equivalent to every call the resource makes on the low level client passing the aws_retry=True argument.

We'd achieve this by wrapping the underlying client methods with the retry decorator as is done now for implicit retries on clients. But in addition to the current aws_retry parameter the decorator would also look for a thread local that's set by the context manager.

Pros:

Enables the use of retry without extracting the logic to another method
Enables the retry settings to be configured in a single place when obtaining the resource
Does not require knowledge of where in the resource the actual AWS calls are made

Cons:

What the developer things of as one retry block is actually enabling fine grained retries on client calls. Might cause confusion because of repeated retries.
Hard to find a good name for the context manager, implies the concept doesn't really make seses
Feels like abusing python context managers

Considered and Discarded Options

There are a number of other options considered before settling on the above option as the best option for extending implicit retries to resources

Decorate Resource Methods

Investigated whether we could simply decorate the boto resource so it's methods accept an aws_retry parameter as we did for clients.

Example:

# Creating Resource
s3 = boto3_conn(None, conn_type='resource', resource='s3', region='us-east-1', retry_decorator=AWSRetry.jittered_backoff())

# With Retry
s3.Bucket('mybucket').upload_file('/tmp/hello.txt', 'hello.txt', aws_retry=True)

This turned out to be very hard to use so was discarded because, where the AWS APIs calls are made is not obvious. Deep understanding of boto is required to know which methods to add aws_retry to because much access is lazy and deferred till when the information is needed.

This is further complicated by property access triggering lazy loads where you can't specify aws_retry, iterators and other such objects also triggering lazy loads but being undecorated etc.

Concluded that this would be so complex to use correctly that it would not achieve any benefit.

Inherited aws_retry

Investigated whether we could set aws_retry on a resource and have it "inherited" for all use of that resource.

Example:

# Creating Resource
s3 = boto3_conn(None, conn_type='resource', resource='s3', region='us-east-1', retry_decorator=AWSRetry.jittered_backoff())

# With Retry
s3.Bucket('mybucket', aws_retry=True).upload_file('/tmp/hello.txt', 'hello.txt')

This can be made to work, but becomes very complex and fragile because all properties need to be overwritten and iterators still need to be wrapped.

Additionally, it's not possible to ensure the state is consistency propagated to sub-resources resulting in a very inconsistent experience. E.g. while the above example might work, the following one wouldn't.

# With Retry
s3.Bucket('mybucket', aws_retry=True).Object('hello.txt').upload_file('/tmp/hello.txt')

This also essentially removes control of where retries are and aren't applied. The bucket could be created far away from the use resulting in confusion about what is and isn't retried.

Concluded that this would be so complex to use correctly that it would not achieve any benefit.

Documentation (optional)

This would eventually be documented in the aws module guidelines but my preference would be to add and start using the functionality and only document as a best practice for other contributors once we're sure the kinks are worked out.

Anything else?

Note: The initial implementation of implicit retries for boto3 clients is based on wrapping the client object. While this is functional boto3 actually provides extension events for this kind of modification of the behvaviours of clients.

ansible/ansible#42766 shifts the implementation of implicit retries for boto clients to using this mechanism and would be a prerequisite for extending this functionality to boto3 resources.

ryansb commented 6 years ago

I'm not huge on doing very much to the .resource parts of Boto3, since they are much less consistent than the equivalent .client options. I agree that it could be complex/fragile to extend the current way of aws_retry=True to the general.

I'm also not totally convinced that with ... will end up being all that much less boilerplate for most modules. Some API calls aren't idempotent (don't have ClientRequestToken params) so a retry might actually cause issues (developer awareness to not put that item in a retry context manager would help this). I do like that this makes it easy to customize the retries on calls inside shared code (the ELB+NLB modules, for example).

Would this mean the current aws_retry=True mechanic would go away in favor of the context manager? If so, we'd have to change existing code to not use this mechanic.

willthames commented 6 years ago

I'm inclined to agree with @ryansb - we barely use resources in any of the modules, and it's not that hard to rewrite code that does use resources to use the client instead.

All of the arguments in this proposal that make aws_retry hard with resources just seem like arguments not to use resources.

Are there any compelling use cases of resources that I'm not aware of?

orthanc commented 6 years ago

Would this mean the current aws_retry=True mechanic would go away in favor of the context manager? If so, we'd have to change existing code to not use this mechanic.

If we were to go ahead with this I'd suggest we support both mechanisms with aws_retry=True as the recommended method when using clients with the context manager only recommended when using resources.

I find the client's are a better level of abstraction for writing modules anyway. As far as I can tell, there's only a single module currently that uses resources rather than clients ( s3_website_bucket ) which would suggest that the community agrees.

Overall I think exending this to resources would add more complexity than it's worth given this low usage. Half the reason for raising this is to make sure that if we do decide not to exend the implicit retry functionality to resources we have something to refer back to to explain why.

I do like that this makes it easy to customize the retries on calls inside shared code (the ELB+NLB modules, for example).

I hadn't considered this aspect, with this goal rather than resource support there might be a better option. When I shift the client to the event approach shown in ansible/ansible#42766 I could also add a marker to the client to indicate whether there is implicit retries or not.

Then we could do the following in those common helpers:

# In constructor of helper:
self.connection = connection
self.aws_retry_param = {}
if connection.aws_retry_supported:
    self.aws_retry_param = {"aws_retry": True}

# At each callsite
self.connection.some_call(...., **self.aws_retry_param)

That way if the calling module specifies in a retry decorator when getting the client the common code can retry using that decorator wherever it deams appropriate. If the calling module does not specify a retry decorator then there are no retries.

orthanc commented 6 years ago

I think we can take this as consensus that extending this functionality to resources is unwise. Closing.

ansible / proposals