SlideRuleEarth / sliderule

Server and client framework for on-demand science data processing in the cloud
https://slideruleearth.io
Other
25 stars 11 forks source link

Implement graceful ASG shutdowns #376

Open jpswinski opened 4 months ago

jpswinski commented 4 months ago

When a cluster scales in, the terminating nodes are very quickly shutdown without any consideration for processing requests that they may be running. Current requests just error out, and in the case of proxied requests, the results are just lost.

The desired solution is to implement a graceful shutdown using ASG lifecycle hooks - specifically the terminate hook. The notification should be consumed by the terminating node, and it can then stop registering to the orchestrator and then wait 10 minutes. Or ideally, if the node has the ability to determine if any requests are running on it, then it could wait until all requests have completed before sending the continue command to the lifecycle (this would avoid the global 10 minute wait).

jpswinski commented 4 months ago

See https://circleci.com/blog/graceful-shutdown-using-aws