StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.07k stars 749 forks source link

Add support for RabbitMQ heartbeats in st2.conf #4780

Open arm4b opened 5 years ago

arm4b commented 5 years ago

Experiencing MQ issues with ST2 HA env where st2 services can be rescheduled/restarted/killed on a random basis which is normal in K8s. For example after some time RMQ eventually can report more then expected consumers for queues, that's undesired.

This probably means issues with st2 code and how clients are interacting with the message bus. It's not clear if duplicated clients are in some zombie-state and actually consuming/wasting incoming messages or not.

For better work with MQ and monitoring connections, add support for RabbitMQ heartbeats setting in st2.conf. This way, clients failing to reply on heartbeat within set interval is disconnected by the server forcefully.

See https://www.rabbitmq.com/heartbeats.html

This will improve overall ST2 HA capabilities.

arm4b commented 5 years ago

kombu is supporting that configuration and we'll need to expose it in st2.conf. However more testing with heartbeat enabled is required to make sure the way how st2 uses rmq client is correct.

In a quick dev/testing environment when MQ heartbeat was enabled via URI connection string (https://www.rabbitmq.com/uri-query-parameters.html) there are issues when components st2actionrunner, st2scheduler, st2workflowengine, st2notfier (which are all part of the common https://github.com/StackStorm/st2/tree/master/st2actions group) gets disconnected via heartbeat mechanism when they're in a running state, while were able to respond on heartbeat OK when they're in idle state. Some kind of concurrency/threading/pool starvation.

igcherkaev commented 4 years ago

Sounds like if heartbeat is enabled and negotiated with server, we need a thread in each st2 component that would iterate all established connections to rabbitmq and call heartbeat_check() method regularly on each.

magiceses commented 3 years ago

Is there any progress on this? I encountered a problem, probably because of the heartbeat. Because there is no heartbeat between kombu and st2, it may cause residual information on the mq server side, leading to errors about "ResourceLocked: Queue.declare: (405) RESOURCE_LOCKED - cannot obtain exclusive access to locked queue 'st2.trigger.watch.St2Timer-ddeda79de6' in vhost '/'. It could be originally declared on another connection or the exclusive property value does not match that of the original declaration."

arm4b commented 3 years ago

I’m not aware of anyone working on this.

If anyone is interested to dig into this problem and help with the implementation, - code contributions are welcome!