StackStorm / stackstorm-k8s

K8s Helm Chart that codifies StackStorm (aka "IFTTT for Ops" https://stackstorm.com/) Highly Availability fleet as a simple to use reproducible infrastructure-as-code app
https://helm.stackstorm.com/
Apache License 2.0
105 stars 107 forks source link

Issues while scaling down the nodes #317

Closed anrajme closed 1 year ago

anrajme commented 2 years ago

Hi there -

We had a few issues lately while the underlying K8s nodes scaled down. During this event, the pods are being evicted ( killed and recreated) on another node which is expected. However, stackstorm-ha reported a few issues. Initially, it was with the stateful set where the RabbitMQ node failures causing events stuck in a "Schduled" status forever. I'm trying to get rid of this trouble by shifting the RabbitMQ service to a Managed cloud service provider.

Now, the recent problem is with st2actionrunner, where the pod get evicted while executing a workflow. The event has been marked as "abandoned" and the workflow execution failed.

image
# st2 execution get 62b019ba420e073fb8f432c3
id: 62b019ba420e073fb8f432c3
action.ref: jira.update_field_value
context.user: xxxxx
parameters:
  field: customfield_14297
  issue_key: xx-96233
  value: Closing Jira 
status: abandoned
start_timestamp: Mon, 20 Jun 2022 06:54:50 UTC
end_timestamp:
log:
  - status: requested
    timestamp: '2022-06-20T06:54:50.171000Z'
  - status: scheduled
    timestamp: '2022-06-20T06:54:50.348000Z'
  - status: running
    timestamp: '2022-06-20T06:54:50.408000Z'
  - status: abandoned
    timestamp: '2022-06-20T06:54:50.535000Z'
result: None

In this case, though we still had another 4 healthy actionrunners running while the one failed where the workflow was executed.

Wondering whether this is expected behaviour and is acceptable for stackstorm-ha architecture ?

cheers!

arm4b commented 2 years ago

Somewhat similar: https://github.com/StackStorm/st2/issues/4716 It's an issue with the stackstorm engine itself handling the sudden stop of the actionrunners which were running tasks in the workflow.

anrajme commented 2 years ago

Thanks @armab. I have updated in the original issue https://github.com/StackStorm/st2/issues/4716 . Looks like this is going to be a game-changer requirement, especially in the k8s ha environment where node/pod kill/restarts are comparatively more frequent than the traditional deployment model.

cognifloyd commented 1 year ago

Closing as a duplicate of StackStorm/st2#4716