Improve resiliency on auth adapter when transient errors happen

ex-aws / ex_aws

A flexible, easy to use set of clients AWS APIs for Elixir

https://hex.pm/packages/ex_aws

MIT License

1.26k stars 521 forks source link

Improve resiliency on auth adapter when transient errors happen #888

Closed juanazam closed 2 years ago

juanazam commented 2 years ago

Closes #798

As described on the issue, when an error happens when retrieving credentials, the error itself is stored on the auth cache, which results in weird errors happening on subsequent calls.

In order to not make extensive changes but still make the library more resilient, a simple retry algorithm is introduced:

30 second window till next credentials refresh is divided into 6 blocks of 5 seconds.
On each block, a node will wait from 0 to 5 seconds till it retries the auth request
If after the threshold nodes haven't been able to refresh the credentials successfully, an error is raised

This implementation doesn't include any breaking changes to the current API.

We were experiencing the issue happening on #798 every other day, and after running this patch, we haven't seen it happen once.

@bernardd I'm tagging you because you participated on the discussion on the issue and you have (at least some) context

Thanks!

bernardd commented 2 years ago

Hi @juanazam - thanks for the PR!

It looks good, I'm just a little concerned that it introduces a non-deterministic bit of behaviour in terms of its retry interval. What's the reasoning behind randomising the delay between each try?

I think I'd also like a config option for the number of retries - I'm sure 6 is a sensible default, but it's the kind of thing you can imagine people might like to have fail faster or retry more on.

Also, if you could give it a quick mix format so that it passes CI, that would be great, thanks :)

juanazam commented 2 years ago

@bernardd Thanks for your review

The reason for randomizing the delays is because the main reason we will see an error here is because of Throttling. When there are several nodes trying to get fresh credentials, Amazon has a 4 per second limit. Lets say 10 node crash, and both 10 nodes try to get fresh credentials at the same time, most of them will hit the 4 per second threshold. By introducing randomization as jitter, we are reducing the chances of requests colliding on the same second, resulting in less changes of nodes getting throttling issues.

We chose 6 retries, because we have 30 seconds for the next credentials refresh, and that's why we randomize the wait to 5 seconds top (6 * 5 = 30).

Making that a configuration would make this logic harder, and given right now we are just storing an error tuple on the cache, I think this may be already a win (more resiliency and no updates on the current API).

I addressed your comments are ran mix format, thanks again 🙏

bernardd commented 2 years ago

Okay, that makes sense. It's not necessarily an ideal solution, but you're right that it's still an improvement from the current behaviour. Thanks very much for your work on it :)

juanazam commented 2 years ago

@bernardd Thanks for taking the time to take a look and maintaining this project, it's used by many :)