Closed anrajme closed 1 year ago
Somewhat similar: https://github.com/StackStorm/st2/issues/4716 It's an issue with the stackstorm engine itself handling the sudden stop of the actionrunners which were running tasks in the workflow.
Thanks @armab. I have updated in the original issue https://github.com/StackStorm/st2/issues/4716 . Looks like this is going to be a game-changer requirement, especially in the k8s ha environment where node/pod kill/restarts are comparatively more frequent than the traditional deployment model.
Closing as a duplicate of StackStorm/st2#4716
Hi there -
We had a few issues lately while the underlying K8s nodes scaled down. During this event, the pods are being evicted ( killed and recreated) on another node which is expected. However, stackstorm-ha reported a few issues. Initially, it was with the stateful set where the RabbitMQ node failures causing events stuck in a "Schduled" status forever. I'm trying to get rid of this trouble by shifting the RabbitMQ service to a Managed cloud service provider.
Now, the recent problem is with st2actionrunner, where the pod get evicted while executing a workflow. The event has been marked as "abandoned" and the workflow execution failed.
In this case, though we still had another 4 healthy actionrunners running while the one failed where the workflow was executed.
Wondering whether this is expected behaviour and is acceptable for stackstorm-ha architecture ?
cheers!