st2actionrunner graceful shutdown

guzzijones commented 3 years ago

This ticket will hold research into graceful shutdown of st2actionrunner. This is in anticipation of adding a way through OS or otherwise to allow us to scale st2actionrunners based on some factor.

My initial research led me to this section of code where the st2actionrunner takes ownership of a scheduled action: st2actionrunner takes ownership

The st2actionrunner abandon code is here: st2actionrunner abandon code

The teardown for the parent process is here: st2actionrunner teardown

We are probably going to create a custom heartbeat script that monitors the number of st2actionrunner processes on a vm to tell the autoscaler to wait until the work is done.

import boto3

response = client.record_lifecycle_action_heartbeat(
    LifecycleHookName='string',
    AutoScalingGroupName='string',
    LifecycleActionToken='string',
    InstanceId='string'
)

guzzijones commented 3 years ago

Another possiblity is for the autoscaler system to query if the st2actionrunner being shutdown has taken ownership of any jobs. If so wait until it no longer has ownership.

nzlosh commented 3 years ago

What is an autoscaler in this context?

guzzijones commented 3 years ago

aws dynamic autoscaling policy

arm4b commented 3 years ago

Do we need some kind of way to mark the specific st2actionrunner as "unschedulable"? Otherwise, in a heavily used st2 dynamic environments it'll pick up the next task from the queue once the previous one is finished.

Talking about the mechanisms. Maybe sending the SIGTERM signal (or other signal) to st2actionrunner process so it'll stop picking up new jobs and finish an old one? Or do we need something more advanced, like a new API endpoint to drain the st2actionrunner?

guzzijones commented 3 years ago

It looks like a SIGTERM is all that is needed. Then the st2actionrunner will pop the message back for scheduling and die. The only problem is AWS Dynamic Scaling will immediatly kill the VM unless you use the boto3.record_livecycle_action_heartbeat to tell AWS to wait while it is still shutting down the process. I see this as a python script that would be supplemental and specific to AWS autoscaling. I don't even think it should be part of core st2 codebase imo.

arm4b commented 3 years ago

Yeah, right. Higher level orchestrator/logic should give some time (like terminationGracePeriodSeconds) for st2actionrunner to finish its work after sending the signal.

In the context of K8s, when the pod is terminated it goes through the following lifecycle:

Pod is set to the “Terminating” State and removed from the endpoints list of all Services
A SIGTERM signal is sent to the main process in each container, and a “grace period” countdown starts.
Upon the receival of the SIGTERM, each container should start a graceful shutdown of the running application and exit.
Graceful shutdown period could be adjusted and configurable (up to a really long periods) to let the process (in our case st2actionrunner) finish its work.
If a container doesn’t terminate within the grace period, a SIGKILL signal will be sent and the container violently terminated.

More:

StackStorm / community

st2actionrunner graceful shutdown #86