StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.09k stars 747 forks source link

Executions stuck in requested state #4539

Closed nmaludy closed 5 years ago

nmaludy commented 5 years ago
SUMMARY

Earlier this week we noticed that our instance had a bunch of executions stuck in the requested state and none were ever being executed. Restarting the host resolved the problem.

ISSUE TYPE
STACKSTORM VERSION
$ st2 --version
st2 2.10.1, on Python 2.7.5
OS / ENVIRONMENT / INSTALL METHOD
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)

# install method: puppet-st2

# single host installed "all in one" with NO MongoDB clustering/replica-set
# the MongoDB instance is a single instance that lives on the same host as StackStorm
STEPS TO REPRODUCE

We've been unable to come up with steps that reproduce the issue consistently. It has been noticed, usually over night, that cron rules were not producing the expected output (emails and tickets). These executions were stuck in the requested state.

EXPECTED RESULTS

Executions run when invoked.

Logs

As stated above, restarted the host fixed the issue. Unfortunately the engineer who found the issue didn't debug any further and try restarting individual services or pin point which service was the issue.

Post-mortem, we found the following in the /var/log/st2/st2scheduler.log


2019-02-04 14:48:45,229 140516293681552 INFO scheduler [-] (PID=8536) Scheduler started.
2019-02-04 14:48:45,418 140516293681552 INFO consumers [-] Starting SchedulerEntrypoint...
2019-02-04 14:48:45,429 140516109106224 INFO mixins [-] Connected to amqp://xxx:**@127.0.0.1:5672//
2019-02-04 14:51:05,599 140516198248688 ERROR traceback [-] Traceback (most recent call last):

2019-02-04 14:51:05,599 140516198248688 ERROR traceback [-]   File "/opt/stackstorm/st2/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 109, in wait

2019-02-04 14:51:05,600 140516198248688 ERROR traceback [-]     listener.cb(fileno)

2019-02-04 14:51:05,600 140516198248688 ERROR traceback [-]   File "/opt/stackstorm/st2/lib/python2.7/site-packages/eventlet/greenthread.py", line 219, in main

2019-02-04 14:51:05,600 140516198248688 ERROR traceback [-]     result = function(*args, **kwargs)

2019-02-04 14:51:05,600 140516198248688 ERROR traceback [-]   File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 67, in run

2019-02-04 14:51:05,601 140516198248688 ERROR traceback [-]     execution_queue_item_db = self._get_next_execution()

2019-02-04 14:51:05,601 140516198248688 ERROR traceback [-]   File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 125, in _get_next_execution

2019-02-04 14:51:05,601 140516198248688 ERROR traceback [-]     execution_queue_item_db = ActionExecutionSchedulingQueue.query(**query).first()

2019-02-04 14:51:05,601 140516198248688 ERROR traceback [-]   File "/opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 294, in first

2019-02-04 14:51:05,602 140516198248688 ERROR traceback [-]     result = queryset[0]

2019-02-04 14:51:05,602 140516198248688 ERROR traceback [-]   File "/opt/stackstorm/st2/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 194, in __getitem__

2019-02-04 14:51:05,602 140516198248688 ERROR traceback [-]     queryset._cursor[key],

2019-02-04 14:51:05,602 140516198248688 ERROR traceback [-]   File "/opt/stackstorm/st2/lib/python2.7/site-packages/pymongo/cursor.py", line 610, in __getitem__

2019-02-04 14:51:05,603 140516198248688 ERROR traceback [-]     for doc in clone:

2019-02-04 14:51:05,603 140516198248688 ERROR traceback [-]   File "/opt/stackstorm/st2/lib/python2.7/site-packages/pymongo/cursor.py", line 1189, in next

2019-02-04 14:51:05,603 140516198248688 ERROR traceback [-]     if len(self.__data) or self._refresh():

2019-02-04 14:51:05,603 140516198248688 ERROR traceback [-]   File "/opt/stackstorm/st2/lib/python2.7/site-packages/pymongo/cursor.py", line 1104, in _refresh

2019-02-04 14:51:05,603 140516198248688 ERROR traceback [-]     self.__send_message(q)

2019-02-04 14:51:05,604 140516198248688 ERROR traceback [-]   File "/opt/stackstorm/st2/lib/python2.7/site-packages/pymongo/cursor.py", line 982, in __send_message

2019-02-04 14:51:05,604 140516198248688 ERROR traceback [-]     helpers._check_command_response(first)

2019-02-04 14:51:05,604 140516198248688 ERROR traceback [-]   File "/opt/stackstorm/st2/lib/python2.7/site-packages/pymongo/helpers.py", line 132, in _check_command_response

2019-02-04 14:51:05,604 140516198248688 ERROR traceback [-]     raise NotMasterError(errmsg, response)

2019-02-04 14:51:05,604 140516198248688 ERROR traceback [-] NotMasterError: interrupted at shutdown

2019-02-04 14:51:05,605 140516198248688 ERROR hub [-] Removing descriptor: 9

2019-02-05 03:28:30,413 140516198248688 INFO misc [-] Re-opening log file "/var/log/st2/st2scheduler.log" with mode "a"

2019-02-05 03:28:30,422 140516198248688 INFO misc [-] Re-opening log file "/var/log/st2/st2scheduler.audit.log" with mode "a"

It appears that the st2scheduler encountered an exception and the greenthread died and caused the service to no longer function.

In the next log file (after rotation) you can see when we fixed the issue:

2019-02-05 03:28:31,993 140516198248688 INFO misc [-] Re-opening log file "/var/log/st2/st2scheduler.log" with mode "a"

2019-02-05 03:28:31,996 140516198248688 INFO misc [-] Re-opening log file "/var/log/st2/st2scheduler.audit.log" with mode "a"

2019-02-05 10:29:53,767 140516293681552 INFO scheduler [-] (PID=8536) Scheduler stopped.
2019-02-05 10:29:53,767 140516293681552 INFO consumers [-] Shutting down SchedulerEntrypoint...
2019-02-05 10:31:30,168 139946447700368 INFO __init__ [-] Connecting to database "st2" @ "127.0.0.1:27017" as user "xxx".
2019-02-05 10:31:30,769 139946447700368 INFO __init__ [-] Successfully connected to database "st2" @ "127.0.0.1:27017" as user "xxx".
2019-02-05 10:31:46,988 139946447700368 AUDIT runnersregistrar [-] RunnerType updated. RunnerType RunnerTypeDB(d
nmaludy commented 5 years ago

Reproduction steps:

to fix

Observations

trstruth commented 5 years ago

Just wanted to echo this issue and say that we've seen the same thing, and also noticed that st2scheduler in particular seems to be the culprit. We bounced each of the containers in our 1ppc setup and st2scheduler being recreated seemed to make this issue disappear.

nmaludy commented 5 years ago

@trstruth can you confirm if you're seeing the same stack trace in the log file for st2scheduler?

trstruth commented 5 years ago

@nmaludy The trace I see in /var/log/st2/st2scheduler.log is

2019-02-07 01:28:50,730 140455903837072 ERROR traceback [-] Traceback (most recent call last):
2019-02-07 01:28:50,730 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 109, in wait
2019-02-07 01:28:50,731 140455903837072 ERROR traceback [-]     listener.cb(fileno)
2019-02-07 01:28:50,731 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/eventlet/greenthread.py", line 219, in main
2019-02-07 01:28:50,733 140455903837072 ERROR traceback [-]     result = function(*args, **kwargs)
2019-02-07 01:28:50,734 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 77, in cleanup
2019-02-07 01:28:50,740 140455903837072 ERROR traceback [-]     self._handle_garbage_collection()
2019-02-07 01:28:50,740 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 94, in _handle_garbage_collection
2019-02-07 01:28:50,741 140455903837072 ERROR traceback [-]     execution_queue_item_dbs = ActionExecutionSchedulingQueue.query(**query) or []
2019-02-07 01:28:50,741 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 211, in __bool__
2019-02-07 01:28:50,742 140455903837072 ERROR traceback [-]     return self._has_data()
2019-02-07 01:28:50,742 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 207, in _has_data
2019-02-07 01:28:50,743 140455903837072 ERROR traceback [-]     return False if queryset.first() is None else True
2019-02-07 01:28:50,743 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 294, in first
2019-02-07 01:28:50,744 140455903837072 ERROR traceback [-]     result = queryset[0]
2019-02-07 01:28:50,744 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 194, in __getitem__
2019-02-07 01:28:50,744 140455903837072 ERROR traceback [-]     queryset._cursor[key],
2019-02-07 01:28:50,744 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 610, in __getitem__
2019-02-07 01:28:50,746 140455903837072 ERROR traceback [-]     for doc in clone:
2019-02-07 01:28:50,746 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1189, in next
2019-02-07 01:28:50,747 140455903837072 ERROR traceback [-]     if len(self.__data) or self._refresh():
2019-02-07 01:28:50,747 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1104, in _refresh
2019-02-07 01:28:50,747 140455903837072 ERROR traceback [-]     self.__send_message(q)
2019-02-07 01:28:50,748 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 931, in __send_message
2019-02-07 01:28:50,748 140455903837072 ERROR traceback [-]     operation, exhaust=self.__exhaust, address=self.__address)
2019-02-07 01:28:50,748 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1145, in _send_message_with_response
2019-02-07 01:28:50,749 140455903837072 ERROR traceback [-]     exhaust)
2019-02-07 01:28:50,749 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1156, in _reset_on_error
2019-02-07 01:28:50,750 140455903837072 ERROR traceback [-]     return func(*args, **kwargs)
2019-02-07 01:28:50,751 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/server.py", line 106, in send_message_with_response
2019-02-07 01:28:50,751 140455903837072 ERROR traceback [-]     reply = sock_info.receive_message(request_id)
2019-02-07 01:28:50,752 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/pool.py", line 612, in receive_message
2019-02-07 01:28:50,752 140455903837072 ERROR traceback [-]     self._raise_connection_failure(error)
2019-02-07 01:28:50,753 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/pool.py", line 745, in _raise_connection_failure
2019-02-07 01:28:50,753 140455903837072 ERROR traceback [-]     raise error
2019-02-07 01:28:50,754 140455903837072 ERROR traceback [-] AutoReconnect: connection closed
2019-02-07 01:28:50,754 140455903837072 ERROR hub [-] Removing descriptor: 16
2019-02-07 01:28:50,759 140455903837072 ERROR traceback [-] Traceback (most recent call last):
2019-02-07 01:28:50,760 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 109, in wait
2019-02-07 01:28:50,760 140455903837072 ERROR traceback [-]     listener.cb(fileno)
2019-02-07 01:28:50,760 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/eventlet/greenthread.py", line 219, in main
2019-02-07 01:28:50,761 140455903837072 ERROR traceback [-]     result = function(*args, **kwargs)
2019-02-07 01:28:50,761 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 67, in run
2019-02-07 01:28:50,761 140455903837072 ERROR traceback [-]     execution_queue_item_db = self._get_next_execution()
2019-02-07 01:28:50,761 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 125, in _get_next_execution
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-]     execution_queue_item_db = ActionExecutionSchedulingQueue.query(**query).first()
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 294, in first
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-]     result = queryset[0]
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 194, in __getitem__
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-]     queryset._cursor[key],
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 610, in __getitem__
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-]     for doc in clone:
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1189, in next
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-]     if len(self.__data) or self._refresh():
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1104, in _refresh
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-]     self.__send_message(q)
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 931, in __send_message
2019-02-07 01:28:50,764 140455903837072 ERROR traceback [-]     operation, exhaust=self.__exhaust, address=self.__address)
2019-02-07 01:28:50,764 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1145, in _send_message_with_response
2019-02-07 01:28:50,764 140455903837072 ERROR traceback [-]     exhaust)
2019-02-07 01:28:50,764 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1156, in _reset_on_error
2019-02-07 01:28:50,764 140455903837072 ERROR traceback [-]     return func(*args, **kwargs)
2019-02-07 01:28:50,765 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/server.py", line 106, in send_message_with_response
2019-02-07 01:28:50,765 140455903837072 ERROR traceback [-]     reply = sock_info.receive_message(request_id)
2019-02-07 01:28:50,765 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/pool.py", line 612, in receive_message
2019-02-07 01:28:50,766 140455903837072 ERROR traceback [-]     self._raise_connection_failure(error)
2019-02-07 01:28:50,766 140455903837072 ERROR traceback [-]   File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/pool.py", line 745, in _raise_connection_failure
2019-02-07 01:28:50,766 140455903837072 ERROR traceback [-]     raise error
2019-02-07 01:28:50,766 140455903837072 ERROR traceback [-] AutoReconnect: connection closed
2019-02-07 01:28:50,766 140455903837072 ERROR hub [-] Removing descriptor: 14
nmaludy commented 5 years ago

Awesome, same but different. Looks like a mongodb error, different error, but when it happens it kills the scheduler service.

bigmstone commented 5 years ago

When scheduler is started we start an instance of entrypoint and handler. entrypoint is event driven based on events from the bus. handler is not. It is a loop that handles actions from the scheduler queue collection in mongo. Looks like if that thread falls over for any reason it puts the scheduler into a state where the PID is still operational, but the thread that processes new events is dead. Two things as to note from that:

  1. We need some restart logic for that thread.
  2. If we cannot successfully restart that thread after n attempts we need to exit the entire process with an error message stating the inability to restart the thread.
m4dcoder commented 5 years ago

We can add a try-catch in the while loop at the run method of the scheduler handler to catch for pymongo.errors.AutoConnect which is the superclass of NotMasterError and other MongoDB related connection errors. This should stop the while loop from exiting and will continue until MongoDB connection is restored. @Kami Can you hold on the 2.10.2 patch release until this is fixed? I'll work on this Monday.

Kami commented 5 years ago

@m4dcoder Ack.

We should also figure out and document (developer docs) a consistent way of handling such errors in all the services - e.g. try to retry up to X times and then "fail loudly" (aka exit with non zero) so it works well with services managers and environments like Kubernets which handle things such as service restarts and failovers.

nmaludy commented 5 years ago

Yeah, I even think systemd handles restarts similarly, so having the service exit with an error and be restarted by whatever scheduler is managing it should be reasonable behavior.

Kami commented 5 years ago

Another thing worth noting is that we have two types of services - "simple" service and "composite" / "container" service.

Simple services serve a single purpose and when a fatal exception happens, we should propagate it and exit and allow service manager / Kubernetes to handle restart and failover.

Other case is "container" service - currently we only have one and that is st2senorcontainer. This service runs multiple sensors, each in a dedicated subprocess. If a single sensor process dies, sensor container will try to restart it for X times before giving up.

In this "container" scenario it makes sense to try to restart the sensor subprocess and it doesn't make sense for the st2sensorcontainer to consider such errors as fatal and exit itself - this would negatively affect other running sensors. And we have a special "single sensor mode" to be used in cluster managed deployments such as Kubernetes anyway.

I would argue that st2scheduler is a "simple" service so we shouldn't do any retries on fatal errors and simply propagate them and exit and let service manager handle it.

You could argue that it's a composite service since it has "handler" and "entrypoint" thread, but those are tightly coupled (per @bigmstone comment above) so I think we should simple consider any error as fatal and exit.

m4dcoder commented 5 years ago

@Kami st2scheduler is a simple service per your definition. My plan here is to retry on specific errors like connection related. If the retries exhausted or if it's an unknown exception, we'll exit and return an error code.

m4dcoder commented 5 years ago

@Kami The WIP PR for the fix at https://github.com/StackStorm/st2/pull/4543. Going to let you review first before I spend time on unit tests. Basically, I added retries to the handler methods and then make the handler exit properly.

m4dcoder commented 5 years ago

Also, I manually tested this. I stopped and restarted MongoDB to let the retries take care of recovery. I also waited for retries to exhaust so the scheduler exit properly.