Open guzzijones opened 3 years ago
Another possiblity is for the autoscaler system to query if the st2actionrunner being shutdown has taken ownership of any jobs. If so wait until it no longer has ownership.
What is an autoscaler in this context?
aws dynamic autoscaling policy
Do we need some kind of way to mark the specific st2actionrunner as "unschedulable"? Otherwise, in a heavily used st2 dynamic environments it'll pick up the next task from the queue once the previous one is finished.
Talking about the mechanisms. Maybe sending the SIGTERM signal (or other signal) to st2actionrunner process so it'll stop picking up new jobs and finish an old one? Or do we need something more advanced, like a new API endpoint to drain the st2actionrunner?
It looks like a SIGTERM is all that is needed. Then the st2actionrunner will pop the message back for scheduling and die. The only problem is AWS Dynamic Scaling will immediatly kill the VM unless you use the boto3.record_livecycle_action_heartbeat to tell AWS to wait while it is still shutting down the process. I see this as a python script that would be supplemental and specific to AWS autoscaling. I don't even think it should be part of core st2 codebase imo.
Yeah, right.
Higher level orchestrator/logic should give some time (like terminationGracePeriodSeconds
) for st2actionrunner to finish its work after sending the signal.
In the context of K8s, when the pod is terminated it goes through the following lifecycle:
More:
This ticket will hold research into graceful shutdown of st2actionrunner. This is in anticipation of adding a way through OS or otherwise to allow us to scale st2actionrunners based on some factor.
My initial research led me to this section of code where the st2actionrunner takes ownership of a scheduled action: st2actionrunner takes ownership
The st2actionrunner abandon code is here: st2actionrunner abandon code
The teardown for the parent process is here: st2actionrunner teardown
We are probably going to create a custom heartbeat script that monitors the number of st2actionrunner processes on a vm to tell the autoscaler to wait until the work is done.