Closed chrisbeach closed 5 years ago
Hey @chrisbeach, thanks for raising this issue!
As part of our design of Escalator, it was assumed that Escalator would be running as a deployment in a Kubernetes, which we can then make use of the core Kubernetes functionality of restarting Escalator if it crashes or exits unexpectedly. If Escalator is restarting quite often in a short amount of time, Kubernetes will put the pod into a CrashLoopBackOff state and will exponentially backoff before attempting to start the pod again.
The other benefit to relying on Kubernetes for the handling of restarting Escalator when problems occur is that it is cloud provider agnostic, meaning that if we implement a new cloud provider, we won't have to add the rate limit backoff logic for that cloud provider too.
By specifically handling the rate limit errors, it hides the underlying problem - that you are receiving rate limit errors from AWS.
I would also recommend reaching out to AWS/investigate into why you are being rate limited, as Escalator doesn't make that many ASG API calls when operating. There may be something else in your environment that is consuming the rate limit.
Let me know if you have any other questions.
Thanks @awprice. That explanation makes total sense, and I understand the reasoning.
I have to admit, I'm unfamiliar with the internals of Escalator, so I can only speculate here - My only concern with the crash/restart strategy is that Escalator's internal state (if it has any) may become inconsistent due to crash/restart?
You're right, Escalator does have an internal state, which is retrieved from the Kubernetes API server. It's internal state consists of all of the nodes and pods in the cluster and could be considered stateless as it's state is based on whatever is in the Kubernetes API server.
When it starts up (either after a crash/restart or for the first time) it will retrieve the nodes and pods from the Kubernetes API server and will then start working based on those nodes/pods.
So to answer your question - no it won't become inconsistent, it will just re-synchronise with the Kubernetes API server and will continue operating as it was before the crash.
@awprice good to know. Thanks for taking the time to explain this.
Escalator crashes due to AWS rate limits being hit:
I believe Escalator should handle rate limit errors more gracefully, with exponential back-off