AWS rate limit error cause Escalator to crash

chrisbeach commented 5 years ago

Escalator crashes due to AWS rate limits being hit:

failed to describe asgs [my-asg]. err: Throttling: Rate exceeded\n\tstatus code: 400, request id: 06f13e44-1ee8-11e9-...
Failed to create cloudproviderThrottling: Rate exceeded\n\tstatus code: 400, request id: 06f13e44-1ee8-11e9-a92e...

I believe Escalator should handle rate limit errors more gracefully, with exponential back-off

awprice commented 5 years ago

Hey @chrisbeach, thanks for raising this issue!

As part of our design of Escalator, it was assumed that Escalator would be running as a deployment in a Kubernetes, which we can then make use of the core Kubernetes functionality of restarting Escalator if it crashes or exits unexpectedly. If Escalator is restarting quite often in a short amount of time, Kubernetes will put the pod into a CrashLoopBackOff state and will exponentially backoff before attempting to start the pod again.

The other benefit to relying on Kubernetes for the handling of restarting Escalator when problems occur is that it is cloud provider agnostic, meaning that if we implement a new cloud provider, we won't have to add the rate limit backoff logic for that cloud provider too.

By specifically handling the rate limit errors, it hides the underlying problem - that you are receiving rate limit errors from AWS.

I would also recommend reaching out to AWS/investigate into why you are being rate limited, as Escalator doesn't make that many ASG API calls when operating. There may be something else in your environment that is consuming the rate limit.

Let me know if you have any other questions.

chrisbeach commented 5 years ago

Thanks @awprice. That explanation makes total sense, and I understand the reasoning.

I have to admit, I'm unfamiliar with the internals of Escalator, so I can only speculate here - My only concern with the crash/restart strategy is that Escalator's internal state (if it has any) may become inconsistent due to crash/restart?

awprice commented 5 years ago

You're right, Escalator does have an internal state, which is retrieved from the Kubernetes API server. It's internal state consists of all of the nodes and pods in the cluster and could be considered stateless as it's state is based on whatever is in the Kubernetes API server.

When it starts up (either after a crash/restart or for the first time) it will retrieve the nodes and pods from the Kubernetes API server and will then start working based on those nodes/pods.

So to answer your question - no it won't become inconsistent, it will just re-synchronise with the Kubernetes API server and will continue operating as it was before the crash.

chrisbeach commented 5 years ago

@awprice good to know. Thanks for taking the time to explain this.

atlassian / escalator

AWS rate limit error cause Escalator to crash #140