Closed nmaludy closed 5 years ago
systemctl restart mongod
/var/log/st2/st2scheduler.log
the same stack trace as abovest2 run core.local cmd=ls
(note: this will not return, you'll need to ctrl-c
out)requested
state with st2 execution list
systemctl restart st2scheduler
st2scheduler
makes StackStorm usable again and start processing executions.Just wanted to echo this issue and say that we've seen the same thing, and also noticed that st2scheduler in particular seems to be the culprit. We bounced each of the containers in our 1ppc setup and st2scheduler being recreated seemed to make this issue disappear.
@trstruth can you confirm if you're seeing the same stack trace in the log file for st2scheduler?
@nmaludy The trace I see in /var/log/st2/st2scheduler.log
is
2019-02-07 01:28:50,730 140455903837072 ERROR traceback [-] Traceback (most recent call last):
2019-02-07 01:28:50,730 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 109, in wait
2019-02-07 01:28:50,731 140455903837072 ERROR traceback [-] listener.cb(fileno)
2019-02-07 01:28:50,731 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/eventlet/greenthread.py", line 219, in main
2019-02-07 01:28:50,733 140455903837072 ERROR traceback [-] result = function(*args, **kwargs)
2019-02-07 01:28:50,734 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 77, in cleanup
2019-02-07 01:28:50,740 140455903837072 ERROR traceback [-] self._handle_garbage_collection()
2019-02-07 01:28:50,740 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 94, in _handle_garbage_collection
2019-02-07 01:28:50,741 140455903837072 ERROR traceback [-] execution_queue_item_dbs = ActionExecutionSchedulingQueue.query(**query) or []
2019-02-07 01:28:50,741 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 211, in __bool__
2019-02-07 01:28:50,742 140455903837072 ERROR traceback [-] return self._has_data()
2019-02-07 01:28:50,742 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 207, in _has_data
2019-02-07 01:28:50,743 140455903837072 ERROR traceback [-] return False if queryset.first() is None else True
2019-02-07 01:28:50,743 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 294, in first
2019-02-07 01:28:50,744 140455903837072 ERROR traceback [-] result = queryset[0]
2019-02-07 01:28:50,744 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 194, in __getitem__
2019-02-07 01:28:50,744 140455903837072 ERROR traceback [-] queryset._cursor[key],
2019-02-07 01:28:50,744 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 610, in __getitem__
2019-02-07 01:28:50,746 140455903837072 ERROR traceback [-] for doc in clone:
2019-02-07 01:28:50,746 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1189, in next
2019-02-07 01:28:50,747 140455903837072 ERROR traceback [-] if len(self.__data) or self._refresh():
2019-02-07 01:28:50,747 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1104, in _refresh
2019-02-07 01:28:50,747 140455903837072 ERROR traceback [-] self.__send_message(q)
2019-02-07 01:28:50,748 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 931, in __send_message
2019-02-07 01:28:50,748 140455903837072 ERROR traceback [-] operation, exhaust=self.__exhaust, address=self.__address)
2019-02-07 01:28:50,748 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1145, in _send_message_with_response
2019-02-07 01:28:50,749 140455903837072 ERROR traceback [-] exhaust)
2019-02-07 01:28:50,749 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1156, in _reset_on_error
2019-02-07 01:28:50,750 140455903837072 ERROR traceback [-] return func(*args, **kwargs)
2019-02-07 01:28:50,751 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/server.py", line 106, in send_message_with_response
2019-02-07 01:28:50,751 140455903837072 ERROR traceback [-] reply = sock_info.receive_message(request_id)
2019-02-07 01:28:50,752 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/pool.py", line 612, in receive_message
2019-02-07 01:28:50,752 140455903837072 ERROR traceback [-] self._raise_connection_failure(error)
2019-02-07 01:28:50,753 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/pool.py", line 745, in _raise_connection_failure
2019-02-07 01:28:50,753 140455903837072 ERROR traceback [-] raise error
2019-02-07 01:28:50,754 140455903837072 ERROR traceback [-] AutoReconnect: connection closed
2019-02-07 01:28:50,754 140455903837072 ERROR hub [-] Removing descriptor: 16
2019-02-07 01:28:50,759 140455903837072 ERROR traceback [-] Traceback (most recent call last):
2019-02-07 01:28:50,760 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 109, in wait
2019-02-07 01:28:50,760 140455903837072 ERROR traceback [-] listener.cb(fileno)
2019-02-07 01:28:50,760 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/eventlet/greenthread.py", line 219, in main
2019-02-07 01:28:50,761 140455903837072 ERROR traceback [-] result = function(*args, **kwargs)
2019-02-07 01:28:50,761 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 67, in run
2019-02-07 01:28:50,761 140455903837072 ERROR traceback [-] execution_queue_item_db = self._get_next_execution()
2019-02-07 01:28:50,761 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/scheduler/handler.py", line 125, in _get_next_execution
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-] execution_queue_item_db = ActionExecutionSchedulingQueue.query(**query).first()
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 294, in first
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-] result = queryset[0]
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mongoengine/queryset/base.py", line 194, in __getitem__
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-] queryset._cursor[key],
2019-02-07 01:28:50,762 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 610, in __getitem__
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-] for doc in clone:
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1189, in next
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-] if len(self.__data) or self._refresh():
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1104, in _refresh
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-] self.__send_message(q)
2019-02-07 01:28:50,763 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/cursor.py", line 931, in __send_message
2019-02-07 01:28:50,764 140455903837072 ERROR traceback [-] operation, exhaust=self.__exhaust, address=self.__address)
2019-02-07 01:28:50,764 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1145, in _send_message_with_response
2019-02-07 01:28:50,764 140455903837072 ERROR traceback [-] exhaust)
2019-02-07 01:28:50,764 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1156, in _reset_on_error
2019-02-07 01:28:50,764 140455903837072 ERROR traceback [-] return func(*args, **kwargs)
2019-02-07 01:28:50,765 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/server.py", line 106, in send_message_with_response
2019-02-07 01:28:50,765 140455903837072 ERROR traceback [-] reply = sock_info.receive_message(request_id)
2019-02-07 01:28:50,765 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/pool.py", line 612, in receive_message
2019-02-07 01:28:50,766 140455903837072 ERROR traceback [-] self._raise_connection_failure(error)
2019-02-07 01:28:50,766 140455903837072 ERROR traceback [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/pymongo/pool.py", line 745, in _raise_connection_failure
2019-02-07 01:28:50,766 140455903837072 ERROR traceback [-] raise error
2019-02-07 01:28:50,766 140455903837072 ERROR traceback [-] AutoReconnect: connection closed
2019-02-07 01:28:50,766 140455903837072 ERROR hub [-] Removing descriptor: 14
Awesome, same but different. Looks like a mongodb error, different error, but when it happens it kills the scheduler service.
When scheduler is started we start an instance of entrypoint
and handler
. entrypoint
is event driven based on events from the bus. handler
is not. It is a loop that handles actions from the scheduler queue collection in mongo. Looks like if that thread falls over for any reason it puts the scheduler into a state where the PID is still operational, but the thread that processes new events is dead. Two things as to note from that:
n
attempts we need to exit the entire process with an error message stating the inability to restart the thread.We can add a try-catch in the while loop at the run method of the scheduler handler to catch for pymongo.errors.AutoConnect which is the superclass of NotMasterError and other MongoDB related connection errors. This should stop the while loop from exiting and will continue until MongoDB connection is restored. @Kami Can you hold on the 2.10.2 patch release until this is fixed? I'll work on this Monday.
@m4dcoder Ack.
We should also figure out and document (developer docs) a consistent way of handling such errors in all the services - e.g. try to retry up to X times and then "fail loudly" (aka exit with non zero) so it works well with services managers and environments like Kubernets which handle things such as service restarts and failovers.
Yeah, I even think systemd handles restarts similarly, so having the service exit with an error and be restarted by whatever scheduler is managing it should be reasonable behavior.
Another thing worth noting is that we have two types of services - "simple" service and "composite" / "container" service.
Simple services serve a single purpose and when a fatal exception happens, we should propagate it and exit and allow service manager / Kubernetes to handle restart and failover.
Other case is "container" service - currently we only have one and that is st2senorcontainer
. This service runs multiple sensors, each in a dedicated subprocess. If a single sensor process dies, sensor container will try to restart it for X times before giving up.
In this "container" scenario it makes sense to try to restart the sensor subprocess and it doesn't make sense for the st2sensorcontainer
to consider such errors as fatal and exit itself - this would negatively affect other running sensors. And we have a special "single sensor mode" to be used in cluster managed deployments such as Kubernetes anyway.
I would argue that st2scheduler
is a "simple" service so we shouldn't do any retries on fatal errors and simply propagate them and exit and let service manager handle it.
You could argue that it's a composite service since it has "handler" and "entrypoint" thread, but those are tightly coupled (per @bigmstone comment above) so I think we should simple consider any error as fatal and exit.
@Kami st2scheduler is a simple service per your definition. My plan here is to retry on specific errors like connection related. If the retries exhausted or if it's an unknown exception, we'll exit and return an error code.
@Kami The WIP PR for the fix at https://github.com/StackStorm/st2/pull/4543. Going to let you review first before I spend time on unit tests. Basically, I added retries to the handler methods and then make the handler exit properly.
Also, I manually tested this. I stopped and restarted MongoDB to let the retries take care of recovery. I also waited for retries to exhaust so the scheduler exit properly.
SUMMARY
Earlier this week we noticed that our instance had a bunch of executions stuck in the
requested
state and none were ever being executed. Restarting the host resolved the problem.ISSUE TYPE
STACKSTORM VERSION
OS / ENVIRONMENT / INSTALL METHOD
STEPS TO REPRODUCE
We've been unable to come up with steps that reproduce the issue consistently. It has been noticed, usually over night, that cron rules were not producing the expected output (emails and tickets). These executions were stuck in the
requested
state.EXPECTED RESULTS
Executions run when invoked.
Logs
As stated above, restarted the host fixed the issue. Unfortunately the engineer who found the issue didn't debug any further and try restarting individual services or pin point which service was the issue.
Post-mortem, we found the following in the
/var/log/st2/st2scheduler.log
It appears that the
st2scheduler
encountered an exception and the greenthread died and caused the service to no longer function.In the next log file (after rotation) you can see when we fixed the issue: