Mistral workflows hanging in running status in st2 while ok in mistral

johandahlberg commented 8 years ago

I've observed that some of our mistral workflows get stuck with a running status in st2, but when I check the corresponding workflow in mistral it has reached some end state, success/error/etc.

Here is an example:

$ st2 execution get 5788770148625e59caf6b9b6
id: 5788770148625e59caf6b9b6
action.ref: arteria-packs.ngi_uu_workflow
parameters: 
  host: biotank14.sequencing
  runfolder: /data/biotank14/runfolders/160712_ST-E00280_0077_AHNGTLCCXX
status: running (463029s elapsed)
start_timestamp: 2016-07-15T05:39:13.623947Z
end_timestamp: None
+-----------------------------+--------------------------+--------------------------------------+--------------------------------------+-------------------------------+
| id                          | status                   | task                                 | action                               | start_timestamp               |
+-----------------------------+--------------------------+--------------------------------------+--------------------------------------+-------------------------------+
|   5788770248625e7d1bb224ee  | succeeded (0s elapsed)   | note_workflow_repo_version           | core.local                           | Fri, 15 Jul 2016 05:39:14 UTC |
|   5788770348625e7d1bb224f0  | succeeded (1s elapsed)   | get_config                           | arteria-packs.get_pack_config        | Fri, 15 Jul 2016 05:39:15 UTC |
|   5788770548625e7d1bb224f2  | succeeded (0s elapsed)   | mark_as_started                      | core.http                            | Fri, 15 Jul 2016 05:39:17 UTC |
|   5788770648625e7d1bb224f4  | succeeded (0s elapsed)   | get_runfolder_name                   | core.local                           | Fri, 15 Jul 2016 05:39:18 UTC |
|   5788770748625e7d1bb224f6  | succeeded (0s elapsed)   | get_flowcell_name                    | arteria-packs.get_flowcell_from_runf | Fri, 15 Jul 2016 05:39:19 UTC |
|                             |                          |                                      | older_name                           |                               |
|   5788770848625e7d1bb224f8  | succeeded (1s elapsed)   | download_samplesheet                 | core.http                            | Fri, 15 Jul 2016 05:39:20 UTC |
|   5788770948625e7d1bb224fa  | succeeded (1s elapsed)   | construct_bcl2fastq_body             | arteria-packs.parse_bcl2fastq_args   | Fri, 15 Jul 2016 05:39:21 UTC |
|   5788770a48625e7d1bb224fc  | succeeded (1s elapsed)   | run_bcl2fastq                        | core.http                            | Fri, 15 Jul 2016 05:39:22 UTC |
|   5788770b48625e7d1bb224fe  | succeeded                | poll_demultiplex_status              | arteria-packs.poll_status            | Fri, 15 Jul 2016 05:39:23 UTC |
|                             | (11891s elapsed)         |                                      |                                      |                               |
|   5788a57f48625e7d1bb22930  | succeeded (0s elapsed)   | download_sisyphus_config             | core.http                            | Fri, 15 Jul 2016 08:57:35 UTC |
|   5788a58048625e7d1bb22932  | succeeded (1s elapsed)   | construct_report_body                | arteria-packs.parse_siswrap_args     | Fri, 15 Jul 2016 08:57:36 UTC |
|   5788a58148625e7d1bb22934  | succeeded (1s elapsed)   | run_sisyphus_quick_report            | core.http                            | Fri, 15 Jul 2016 08:57:37 UTC |
|   5788a58348625e7d1bb22936  | succeeded                | poll_report_status                   | arteria-packs.poll_status            | Fri, 15 Jul 2016 08:57:39 UTC |
|                             | (24621s elapsed)         |                                      |                                      |                               |
|   578905b148625e7d1bb239d0  | succeeded (0s elapsed)   | download_qc_config                   | core.http                            | Fri, 15 Jul 2016 15:48:01 UTC |
|   578905b248625e7d1bb239d2  | succeeded (0s elapsed)   | construct_qc_body                    | arteria-packs.parse_siswrap_args     | Fri, 15 Jul 2016 15:48:02 UTC |
|   578905b348625e7d1bb239d4  | succeeded (1s elapsed)   | run_sisyphus_qc                      | core.http                            | Fri, 15 Jul 2016 15:48:03 UTC |
|   578905b448625e7d1bb239d6  | succeeded (1s elapsed)   | poll_qc_status                       | arteria-packs.poll_status            | Fri, 15 Jul 2016 15:48:04 UTC |
|   578905b548625e7d1bb239d8  | succeeded (1s elapsed)   | get_year                             | core.local                           | Fri, 15 Jul 2016 15:48:05 UTC |
|   578905b748625e7d1bb239da  | succeeded (49s elapsed)  | rsync_to_summary_host                | arteria-packs.rsync                  | Fri, 15 Jul 2016 15:48:06 UTC |
|   578905e848625e7d1bb239dc  | succeeded (0s elapsed)   | check_hosts_to_rsync_to              | core.http                            | Fri, 15 Jul 2016 15:48:56 UTC |
|   578905e948625e7d1bb239df  | succeeded (0s elapsed)   | make_irma_samplesheet_changes        | core.local                           | Fri, 15 Jul 2016 15:48:57 UTC |
|   578905e948625e7d1bb239e0  | succeeded (0s elapsed)   | make_irma_samplesheet_changes        | core.local                           | Fri, 15 Jul 2016 15:48:57 UTC |
| + 578905ea48625e7d1bb239e2  | failed (44556s elapsed)  | rsync_to_irma                        | arteria-packs.sync_workflow          | Fri, 15 Jul 2016 15:48:58 UTC |
|    578905eb48625e7d1bb239e4 | succeeded (1s elapsed)   | note_workflow_repo_version           | core.local                           | Fri, 15 Jul 2016 15:48:59 UTC |
|    578905ec48625e7d1bb239e6 | succeeded                | create_md5sums_for_files_to_transfer | arteria-                             | Fri, 15 Jul 2016 15:49:00 UTC |
|                             | (2334s elapsed)          |                                      | packs.create_sis_style_checksums     |                               |
|    57890f0b48625e7d1bb23a00 | failed (42217s elapsed)  | rsync_to_destination                 | arteria-packs.rsync                  | Fri, 15 Jul 2016 16:27:55 UTC |
|   5789b3fa48625e7d1bb252de  | succeeded (0s elapsed)   | oh_shit_error                        | core.sendmail                        | Sat, 16 Jul 2016 04:11:38 UTC |
|   5789b3fa48625e7d1bb252e0  | succeeded (0s elapsed)   | mark_as_failed                       | core.http                            | Sat, 16 Jul 2016 04:11:38 UTC |
+-----------------------------+--------------------------+--------------------------------------+--------------------------------------+-------------------------------+
$ mistral execution-get 850fc1a6-0d9a-4cdb-b892-95580320e922
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| ID                | 850fc1a6-0d9a-4cdb-b892-95580320e922 |
| Workflow ID       | 6dc5b938-ac8e-41da-b773-8bfafab4d88e |
| Workflow name     | arteria-packs.ngi_uu_workflow.main   |
| Description       |                                      |
| Task Execution ID | <none>                               |
| State             | ERROR                                |
| State info        | None                                 |
| Created at        | 2016-07-15 05:39:14.668164           |
| Updated at        | 2016-07-16 04:11:39.235464           |
+-------------------+--------------------------------------+

Looking for this id in st2resultstracker.log I found the following:

2016-07-15 14:19:48,821 139900353161808 ERROR v2 [-] [5788770148625e59caf6b9b5] Unable to fetch mistral workflow result and tasks. {'mistral': {'workflow_name': u'arteria-packs.ngi_uu_workflow.main', 'execution_id': '850fc1a6-0d9a-4cdb-b892-95580320e922'}}
Traceback (most recent call last):
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/query/mistral/v2.py", line 65, in query
    result['tasks'] = self._get_workflow_tasks(mistral_exec_id)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/query/mistral/v2.py", line 105, in _get_workflow_tasks
    wf_tasks = tasks.TaskManager(self._client).list(workflow_execution_id=exec_id)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mistralclient/api/v2/tasks.py", line 34, in list
    return self._list(url, response_key='tasks')
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mistralclient/api/base.py", line 114, in _list
    self._raise_api_exception(resp)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mistralclient/api/base.py", line 143, in _raise_api_exception
    error_message=error_data)
APIException: (psycopg2.OperationalError) FATAL:  remaining connection slots are reserved for non-replication superuser connections

2016-07-15 14:19:48,821 139900353161808 WARNING mistral [-] Determining if <class 'mistralclient.api.base.APIException'> should be retried...
2016-07-15 14:19:48,821 139900353161808 ERROR base [-] Failed querying results for liveaction_id 5788770148625e59caf6b9b5.
Traceback (most recent call last):
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/query/base.py", line 91, in _query_and_save_results
    (status, results) = self.query(execution_id, actual_query_context)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/retrying.py", line 206, in call
    return attempt.get(self._wrap_exception)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/query/mistral/v2.py", line 65, in query
    result['tasks'] = self._get_workflow_tasks(mistral_exec_id)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/query/mistral/v2.py", line 105, in _get_workflow_tasks
    wf_tasks = tasks.TaskManager(self._client).list(workflow_execution_id=exec_id)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mistralclient/api/v2/tasks.py", line 34, in list
    return self._list(url, response_key='tasks')
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mistralclient/api/base.py", line 114, in _list
    self._raise_api_exception(resp)
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/mistralclient/api/base.py", line 143, in _raise_api_exception
    error_message=error_data)
APIException: (psycopg2.OperationalError) FATAL:  remaining connection slots are reserved for non-replication superuser connections

2016-07-15 14:19:48,826 139900353161808 INFO base [-] Clearing state object: ActionExecutionStateDB(execution_id=5788770148625e59caf6b9b5, id=5788770248625e596a7a4445, query_context={u'mistral': {u'workflow_name': u'arteria-packs.ngi_uu_workflow.main', u'execution_id': u'850fc1a6-0d9a-4cdb-b892-95580320e922'}}, query_module="st2actions.query.mistral.v2")

We could then confirm that we did indeed have to approximately 100 connections to postgres, with max_connections=100. All of these connections were either COMMIT or ROLLBACK statements, and seemed to be coming from mistral.

We have now increased the maximum number of connections, and we'll see if that helps.

Unfortunately I have so far not had time to create a reproducible example which causes this to happen. If I'll manage to find the time for that I'll add it here.

Versions and OS st2 version: 1.4.0-8 st2 mistral version: 0.1.2 OS: Ubuntu 14.04.4 LTS

edit 1: Added versions and OS info

lakshmi-kannan commented 8 years ago

@johandahlberg Thanks for the report!

Please include information about

st2 version
st2mistral version
OS

johandahlberg commented 8 years ago

Sorry for missing to include the info. Here it is:

st2 version: 1.4.0-8 st2 mistral version: 0.1.2 OS: Ubuntu 14.04.4 LTS

I'll edit the top issue as well to make it easier to find in the future.

m4dcoder commented 8 years ago

As always, we recommend users to run the latest version which is v1.5.1 at time of writing.

The error here means available connections on postgresql are consumed. There are a number of things that can be done to alleviate.

Increase the number of postgresql connections on server side and configure the client side configuration for SqlAlchemy appropriately. As your workload on StackStorm/Mistral increases, these values should be changed accordingly. Please consult postgresql and SqlAlchemy documentation for more details.

On a typical st2 system we have in our environment, we set the postgresql max_connections = 500 at postgresql.conf. Then in /etc/mistral/mistral.conf, we set the client settings for the database to the following.

[database]
max_pool_size = 50
max_overflow = 100
pool_recycle = 3600

Also, at /etc/mistral/mistral.conf, we enable mistral to purge database records older than 7 days. But you can change the data retention period per your organization's comfort level and policy. Having too much records in the database will affect performance.

[execution_expiration_policy]
evaluation_interval = 360
older_than = 10080

Remember to restart postgresql and mistral services after making these configuration changes.

johandahlberg commented 8 years ago

@m4dcoder We are looking at upgrading StackStorm as quickly as possible, however we have not yet been able to find a window to do so on our production systems.

Thank you for providing some input on what a typical st2/mistral configuration would look like in this regard - since we increased max_connections on postgres we have not seen this problem - so maybe this will solve it.

A question on the execution_expiration_policy part - will this still keep the records in StackStorm, so that if we e.g. look up a trace tag we will still be able to find it?

Kami commented 8 years ago

@m4dcoder Can you please make sure those two things get documented and added to st2docs (I couldn't find any existing documentation on it, but if they are already documented, please ignore my comment)?

How to set expiration policy for Mistral (also some background on how it works, why it's needed, etc.).

We already have a section on this for StackStorm, but AFAIK, there isn't one yet for Mistral.

How to tune Mistral max_connections - aka what to do when connection limit has been reached, etc.

Some background on how it works, why it occurs (long running workflows, no connection pooling, connections also used for locking, etc.) and how to bump max_connections, tune other settings, etc. to alleviate the issues.

Both of those sections should probably go in the "Troubleshooting" chapter.

I can also do 2) myself, but you have more background and context on 1) and probably also 2).

m4dcoder commented 8 years ago

Please refer to the KB article here to resolve mistral workflows that are stuck in RUNNING state.

StackStorm / st2

Mistral workflows hanging in running status in st2 while ok in mistral #2821