StackStorm / st2

StackStorm (aka "IFTTT for Ops") is event-driven automation for auto-remediation, incident responses, troubleshooting, deployments, and more for DevOps and SREs. Includes rules engine, workflow, 160 integration packs with 6000+ actions (see https://exchange.stackstorm.org) and ChatOps. Installer at https://docs.stackstorm.com/install/index.html
https://stackstorm.com/
Apache License 2.0
6.07k stars 749 forks source link

Mistral workflow hanging st2api #4492

Closed poojabayana1995 closed 5 years ago

poojabayana1995 commented 5 years ago
SUMMARY

Mistral workflow is stuck in running state and the st2api corresponding to mistral for callbacks is hanging . This issue is noticed when mistral workflows are run continuously for a period of time or multiple workflows are running concurrently. It has been noticed that the workflows are hanging exactly at join all action .

Our workbook consists of multiple workflows where a task in the workflow calls another workflow in parallel which is followed by a task that joins the results. Mistral conf is altered to us backend database as mysql . We are using mysql unlike postgresql by default .

On restart of the st2api , the actions resume and starts working .But in few cases when there are mutiple request piled up the st2api has to be restarted mutiple times . The debug logs also donot show any error in any of the actionrunners ,mistral server/api or st2api .

mistral_test_workflow-master.zip

ISSUE TYPE
STACKSTORM VERSION

st2 version - 2.8.1 mistral - 2.8.1 OS - Centos 7

OS / ENVIRONMENT / INSTALL METHOD

Our stackstorm is in HA across various servers .

STEPS TO REPRODUCE
version: '2.0'
name: 'mistral_test_workflow.test_workflow'
workflows:
    main:
     description: >
        Testing workflow 1
     type: direct

     input:
        - metric_list
        - count

     tasks:

        call_another_workflow:
         with-items:
                - metric in <% $.metric_list %>
         workflow: another_workflow
         input :
                metric: "<% $.metric %>"
                executionID: "<% env().get('__actions').get('st2.action').st2_context.parent.execution_id%>"
         publish:
                task_status: "Succeeded"
         publish-on-error:
                task_status: "Failed"
         on-success: display_result
         on-error: update_details

        display_result:
            join: all
            action: core.remote
            input:
                hosts: "{{ st2kv.system.haas_file_server_name}}"
                cwd: "{{ st2kv.system.cwd}}"
                cmd: "echo 'Remote shell script'"
            publish:
                task_status: "Succeeded"
            publish-on-error:
                task_status: "Failed"

            on-success:
                - update_details
            on-error:
                - update_details
            retry:
                count: 1
                delay: 5

        update_details:
           action: core.local
           input:
                cmd: "echo 'Succeeded'"
           publish:
                task_status: "Succeeded"
           publish-on-error:
                    failure_cause: "<% task(update_details).result %>"
                    task_status: "Failed"
           on-error: check_for_failure

        check_for_failure:
                action: core.local
                input:
                        cmd: "echo 'Action Chain failed with errors'"

    another_workflow:
     description: >
        Another_workflow .
     type: direct

     input:
        - metric
     tasks:
        show_result:
           # with-items:
           #     - monitoring_metric in <% $.monitoring_metric_list %>
            action: core.remote
            input:
                hosts: "{{ st2kv.system.remote_server_name}}"
                cwd: "{{ st2kv.system.cwd}}"
                cmd: "echo 'Another remote action'"
            publish:
                task_status: "Succeeded"
            publish-on-error:
                    task_status: "failed"
EXPECTED RESULTS

The expected st2api logs is as below:

2019-01-08 08:02:58,809 140039102137200 INFO profiling [-] MongoDB query: db.runner_type_d_b.find({'name': u'remote-shell-cmd'}); (mongo_query={'name': u'remote-shell-cmd'},mongo_shell_query="db.runner_type_d_b.find({'name': u'remote-shell-cmd'});") 2019-01-08 08:02:58,812 140039102137200 INFO init [-] RESULT "[<RunnerTypeDB: RunnerTypeDB(description="A remote execution runner that executes actions as a fixed system user.", enabled=True, id=5b8d955cdffbda207949b51c, name="remote-shell-cmd", query_module=None, runner_module="remote_command_runner", runner_package="remote_runner", runner_parameters={u'username': {u'required': False, u'type': u'string', u'description': u'Username used to log-in. If not provided, default username from config is used.'}, u'private_key': {u'secret': True, u'required': False, u'type': u'string', u'description': u'Private key material or path to the private key file on disk used to log in.'}, u'sudo_password': {u'default': None, u'secret': True, u'required': False, u'type': u'string', u'description': u'Sudo password. To be used when paswordless sudo is not allowed.'}, u'env': {u'type': u'object', u'description': u'Environment variables which will be available to the command(e.g. key1=val1,key2=val2)'}, u'sudo': {u'default': False, u'type': u'boolean', u'description': u'The remote command will be executed with sudo.'}, u'kwarg_op': {u'default': u'--', u'type': u'string', u'description': u'Operator to use in front of keyword args i.e. "--" or "-".'}, u'passphrase': {u'secret': True, u'required': False, u'type': u'string', u'description': u'Passphrase for the private key, if needed.'}, u'password': {u'secret': True, u'required': False, u'type': u'string', u'description': u'Password used to log in. If not provided, private key from the config file is used.'}, u'port': {u'default': 22, u'required': False, u'type': u'integer', u'description': u'SSH port. Note: This parameter is used only in ParamikoSSHRunner.'}, u'cmd': {u'type': u'string', u'description': u'Arbitrary Linux command to be executed on the remote host(s).'}, u'bastion_host': {u'required': False, u'type': u'string', u'description': u'The host SSH connections will be proxied through. Note: This connection is made using the same parameters as the final connection, and is only used in ParamikoSSHRunner.'}, u'hosts': {u'required': True, u'type': u'string', u'description': u'A comma delimited string of a list of hosts where the remote command will be executed.'}, u'timeout': {u'default': 60, u'type': u'integer', u'description': u"Action timeout in seconds. Action will get killed if it doesn't finish in timeout seconds."}, u'parallel': {u'default': False, u'type': u'boolean', u'description': u'Default to parallel execution.', u'immutable': True}, u'cwd': {u'default': u'/tmp', u'type': u'string', u'description': u'Working directory where the script will be executed in'}, u'dir': {u'default': u'/tmp', u'type': u'string', u'description': u'The working directory where the script will be copied to on the remote host.', u'immutable': True}}, uid="runner_type:remote-shell-cmd")>]"------ 2019-01-08 08:02:58,828 140039102137200 DEBUG router [-] Using response spec "201" for endpoint st2api.controllers.v1.actionexecutions:action_executions_controller.post and status code 201 2019-01-08 08:02:58,829 140039102137200 DEBUG router [-] Match path: /v1/executions 2019-01-08 08:02:58,830 140039102137200 INFO logging [-] e0b22bbe-576f-40dd-a945-3889566a7239 - 201 7025 1870.121ms (content_length=7025,request_id='e0b22bbe-576f-40dd-a945-3889566a7239',runtime=1870.121,remote_addr='x.y.z.w',status=201,method='POST',path='/v1/executions')

ACTUAL RESULTS

Logs before the api is hung . No further logs are generated until api server is restarted.

2019-01-08 08:03:34,304 140039102769744 INFO profiling [-] MongoDB query: db.runner_type_d_b.find({'name': u'remote-shell-cmd'}); (mongo_query={'name': u'remote-shell-cmd'},mongo_shell_query="db.runner_type_d_b.find({'name': u'remote-shell-cmd'});") 2019-01-08 08:03:34,307 140039102769744 INFO init [-] RESULT "[<RunnerTypeDB: RunnerTypeDB(description="A remote execution runner that executes actions as a fixed system user.", enabled=True, id=5b8d955cdffbda207949b51c, name="remote-shell-cmd", query_module=None, runner_module="remote_command_runner", runner_package="remote_runner", runner_parameters={u'username': {u'required': False, u'type': u'string', u'description': u'Username used to log-in. If not provided, default username from config is used.'}, u'private_key': {u'secret': True, u'required': False, u'type': u'string', u'description': u'Private key material or path to the private key file on disk used to log in.'}, u'sudo_password': {u'default': None, u'secret': True, u'required': False, u'type': u'string', u'description': u'Sudo password. To be used when paswordless sudo is not allowed.'}, u'env': {u'type': u'object', u'description': u'Environment variables which will be available to the command(e.g. key1=val1,key2=val2)'}, u'sudo': {u'default': False, u'type': u'boolean', u'description': u'The remote command will be executed with sudo.'}, u'kwarg_op': {u'default': u'--', u'type': u'string', u'description': u'Operator to use in front of keyword args i.e. "--" or "-".'}, u'passphrase': {u'secret': True, u'required': False, u'type': u'string', u'description': u'Passphrase for the private key, if needed.'}, u'password': {u'secret': True, u'required': False, u'type': u'string', u'description': u'Password used to log in. If not provided, private key from the config file is used.'}, u'port': {u'default': 22, u'required': False, u'type': u'integer', u'description': u'SSH port. Note: This parameter is used only in ParamikoSSHRunner.'}, u'cmd': {u'type': u'string', u'description': u'Arbitrary Linux command to be executed on the remote host(s).'}, u'bastion_host': {u'required': False, u'type': u'string', u'description': u'The host SSH connections will be proxied through. Note: This connection is made using the same parameters as the final connection, and is only used in ParamikoSSHRunner.'}, u'hosts': {u'required': True, u'type': u'string', u'description': u'A comma delimited string of a list of hosts where the remote command will be executed.'}, u'timeout': {u'default': 60, u'type': u'integer', u'description': u"Action timeout in seconds. Action will get killed if it doesn't finish in timeout seconds."}, u'parallel': {u'default': False, u'type': u'boolean', u'description': u'Default to parallel execution.', u'immutable': True}, u'cwd': {u'default': u'/tmp', u'type': u'string', u'description': u'Working directory where the script will be executed in'}, u'dir': {u'default': u'/tmp', u'type': u'string', u'description': u'The working directory where the script will be copied to on the remote host.', u'immutable': True}}, uid="runner_type:remote-shell-cmd")>]"------

LindsayHill commented 5 years ago

We are using mysql unlike postgresql by default .

We don't do any ongoing testing with MySQL, and do not recommend its use for the Mistral DB. We have observed problems with it in the past.

You are on your own if you want to use MySQL

poojabayana1995 commented 5 years ago

I am facing the same issue when i change to postgresql . As suggested i have flipped back to postgresql .My workflow ran concurrenlty fine for some time after which the api got stuck as explained earlier .

poojabayana1995 commented 5 years ago

@LindsayHill Can you please take a look at this mistral issue . Despite of using postgresql we are encountering the same problem .