StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.05k stars 745 forks source link

Executions can not be proceeded with errors in st2actionrunner and st2scheduler #5483

Closed yypptest closed 2 years ago

yypptest commented 2 years ago

SUMMARY

We picked up stackstorm v3.6.0 and deployed on kubernetes, the pods can started normally but execution always stuck at requested status. The st2scheduler pod showed below log messages, and st2actionrunner also showed similar error exceptions.

2021-12-06 06:42:44,326 DEBUG [-] Using cached coordinator instance: <tooz.drivers.etcd.EtcdDriver object at 0x7f68890b6b70>
2021-12-06 06:42:44,327 ERROR [-] Traceback (most recent call last):

2021-12-06 06:42:44,327 ERROR [-]
2021-12-06 06:42:44,328 ERROR [-]   File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/hubs/poll.py", line 111, in wait
    listener.cb(fileno)

2021-12-06 06:42:44,328 ERROR [-]
2021-12-06 06:42:44,328 ERROR [-]   File "/opt/stackstorm/st2/lib/python3.6/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)

2021-12-06 06:42:44,328 ERROR [-]
2021-12-06 06:42:44,328 ERROR [-]   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/metrics/base.py", line 216, in wrapper
    return func(*args, **kw)

2021-12-06 06:42:44,328 ERROR [-]
2021-12-06 06:42:44,328 ERROR [-]   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2actions/scheduler/handler.py", line 314, in _handle_execution
    self._schedule(liveaction_db, execution_queue_item_db)

2021-12-06 06:42:44,328 ERROR [-]
2021-12-06 06:42:44,328 ERROR [-]   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2actions/scheduler/handler.py", line 422, in _schedule
    self._update_to_scheduled(liveaction_db, execution_queue_item_db)

2021-12-06 06:42:44,328 ERROR [-]
2021-12-06 06:42:44,328 ERROR [-]   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2actions/scheduler/handler.py", line 481, in _update_to_scheduled
    publish=False,

2021-12-06 06:42:44,328 ERROR [-]
2021-12-06 06:42:44,328 ERROR [-]   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/services/action.py", line 236, in update_status
    liveaction, set_result_size=set_result_size

2021-12-06 06:42:44,328 ERROR [-]
2021-12-06 06:42:44,328 ERROR [-]   File "/opt/stackstorm/st2/lib/python3.6/site-packages/st2common/services/executions.py", line 199, in update_execution
    with coordination.get_coordinator().get_lock(liveaction_db.id):

2021-12-06 06:42:44,328 ERROR [-]
2021-12-06 06:42:44,329 ERROR [-]   File "/opt/stackstorm/st2/lib/python3.6/site-packages/tooz/drivers/etcd.py", line 255, in get_lock
    return EtcdLock(self.lock_encoder.check_and_encode(name), name,
2021-12-06 06:42:44,329 ERROR [-]
2021-12-06 06:42:44,329 ERROR [-]   File "/opt/stackstorm/st2/lib/python3.6/site-packages/tooz/utils.py", line 40, in check_and_encode
    " or binary type and not %s" % type(name))

2021-12-06 06:42:44,329 ERROR [-]
2021-12-06 06:42:44,329 ERROR [-] TypeError: Provided lock name is expected to be a string or binary type and not <class 'bson.objectid.ObjectId'>

STACKSTORM VERSION

st2: 3.6.0

OS, environment, install method

Deployed helm chart to OCP 4.8 Coordinator: etcd

Steps to reproduce the problem

  1. Deployed stackstorm 3.6.0 with helm chart to OCP 4.8
  2. All pods started normally
  3. Run basic actions, but all actions stuck at requested status
  4. Check log found the error TypeError: Provided lock name is expected to be a string or binary type and not <class 'bson.objectid.ObjectId'>

Expected Results

The actions can be proceeded normally even with etcd coordinator.

Actual Results

There's error reported when proceeding actions.

Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!

yypptest commented 2 years ago

The exception seems from the new codes in v3.6.0 from the file /opt/stackstorm/st2/lib/python3.6/site-packages/st2common/services/executions.py, when calling with coordination.get_coordinator().get_lock(liveaction_db.id)

cognifloyd commented 2 years ago

I'm not in a position to test this, but I'll take a glance at the code. Here's the line you identified: https://github.com/StackStorm/st2/blob/20045dc009f9168373e73e08d3e242985c5c45ed/st2common/st2common/services/executions.py#L199 Which was modified in:

What version of mongo are you using?

yypptest commented 2 years ago

db version is v4.0.27, thanks

arm4b commented 2 years ago

@yypptest Could you please also try with Redis as a coordination backend if the same error persists?