StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.08k stars 747 forks source link

WORKER TIMEOUT in logs, st2 not working #5433

Open amyasnikov opened 2 years ago

amyasnikov commented 2 years ago

Discussed in https://github.com/StackStorm/st2/discussions/5422

Originally posted by **amyasnikov** November 1, 2021 I'm trying to make HA st2 setup without k8s. My setup: - 3 "cluster" nodes with rabbitmq, redis and mongodb on board - 2 app nodes with st2 components. 1 node is active (st2 is in running state) and 1 node in standby (st2 is in stopped state). The first app node works well (st2 login, st2 run, st2 execution get - all commands work fine). The problem arises with second app node when I'm trying to make it active. I stop st2 on first node (st2ctl stop) and the I start st2 on second node(st2ctl start on it). On the second node any st2 command hangs for infinite time without any output (_st2 login_ or _st2 execution list_ or any other). If I do "st2ctl reload" on second node it works well, but st2 still hangs. Meanwhile in logs (st2api.log or st2auth.log) I see messages like this: ``` [2021-11-01 10:59:07 +0000] [9253] [CRITICAL] WORKER TIMEOUT (pid:9496) ``` If I switch back to first node, all works fine again. st2.conf is the same for 2 app nodes. - The network and DNS is fine. I can telnet to all the services (rabbit, mongo, redis) from both app nodes. I also tried mongosh from app node to connection-string from st2.conf, it works well (I tried to connect and read st2 database collections). - datastore_key.json is the same on both nodes - st2admin password is the same on both nodes - St2 installed on Centos 7 via https://docs.stackstorm.com/install/rhel7.html **What can be the reason of that behavior?** PS While preparing this text I have detected another st2auth.log strange message which didn't appear previously: ``` [2021-11-01 10:55:17 +0000] [9084] [ERROR] Socket error processing request. Traceback (most recent call last): File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 65, in handle util.reraise(*sys.exc_info()) File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/util.py", line 626, in reraise raise value File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 38, in handle listener_name = listener.getsockname() OSError: [Errno 9] Bad file descriptor [2021-11-01 10:55:17 +0000] [9084] [ERROR] Socket error processing request. Traceback (most recent call last): File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 65, in handle util.reraise(*sys.exc_info()) File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/util.py", line 626, in reraise raise value File "/opt/stackstorm/st2/lib/python3.6/site-packages/gunicorn/workers/base_async.py", line 38, in handle listener_name = listener.getsockname [st2.conf.txt](https://github.com/StackStorm/st2/files/7453224/st2.conf.txt) ( [st2auth.log](https://github.com/StackStorm/st2/files/7453222/st2auth.log) ) OSError: [Errno 9] Bad file descriptor ``` **software versions:** ``` $ st2 --version st2 3.5.0, on Python 3.6.8 $ uname -a Linux node10 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux ```

Logs and st2.conf: st2auth.log st2.conf.txt

stale[bot] commented 2 years ago

Thanks for contributing to this issue. As it has been 90 days since the last activity, we are automatically marking is as stale. If this issue is not relevant or applicable anymore (problem has been fixed in a new version or similar), please close the issue or let us know so we can close it. On the contrary, if the issue is still relevant, there is nothing you need to do, but if you have any additional details or context which would help us when working on this issue, please include it as a comment to this issue.