Closed johandahlberg closed 8 years ago
@johandahlberg Thanks for the report!
Please include information about
Sorry for missing to include the info. Here it is:
st2 version: 1.4.0-8
st2 mistral version: 0.1.2
OS: Ubuntu 14.04.4 LTS
I'll edit the top issue as well to make it easier to find in the future.
As always, we recommend users to run the latest version which is v1.5.1 at time of writing.
The error here means available connections on postgresql are consumed. There are a number of things that can be done to alleviate.
Increase the number of postgresql connections on server side and configure the client side configuration for SqlAlchemy appropriately. As your workload on StackStorm/Mistral increases, these values should be changed accordingly. Please consult postgresql and SqlAlchemy documentation for more details.
On a typical st2 system we have in our environment, we set the postgresql max_connections = 500 at postgresql.conf. Then in /etc/mistral/mistral.conf, we set the client settings for the database to the following.
[database]
max_pool_size = 50
max_overflow = 100
pool_recycle = 3600
Also, at /etc/mistral/mistral.conf, we enable mistral to purge database records older than 7 days. But you can change the data retention period per your organization's comfort level and policy. Having too much records in the database will affect performance.
[execution_expiration_policy]
evaluation_interval = 360
older_than = 10080
Remember to restart postgresql and mistral services after making these configuration changes.
@m4dcoder We are looking at upgrading StackStorm as quickly as possible, however we have not yet been able to find a window to do so on our production systems.
Thank you for providing some input on what a typical st2/mistral configuration would look like in this regard - since we increased max_connections
on postgres we have not seen this problem - so maybe this will solve it.
A question on the execution_expiration_policy
part - will this still keep the records in StackStorm, so that if we e.g. look up a trace tag we will still be able to find it?
@m4dcoder Can you please make sure those two things get documented and added to st2docs (I couldn't find any existing documentation on it, but if they are already documented, please ignore my comment)?
We already have a section on this for StackStorm, but AFAIK, there isn't one yet for Mistral.
Some background on how it works, why it occurs (long running workflows, no connection pooling, connections also used for locking, etc.) and how to bump max_connections, tune other settings, etc. to alleviate the issues.
Both of those sections should probably go in the "Troubleshooting" chapter.
I can also do 2) myself, but you have more background and context on 1) and probably also 2).
Please refer to the KB article here to resolve mistral workflows that are stuck in RUNNING state.
I've observed that some of our mistral workflows get stuck with a running status in st2, but when I check the corresponding workflow in mistral it has reached some end state, success/error/etc.
Here is an example:
Looking for this id in
st2resultstracker.log
I found the following:We could then confirm that we did indeed have to approximately 100 connections to postgres, with
max_connections=100
. All of these connections were eitherCOMMIT
orROLLBACK
statements, and seemed to be coming from mistral.We have now increased the maximum number of connections, and we'll see if that helps.
Unfortunately I have so far not had time to create a reproducible example which causes this to happen. If I'll manage to find the time for that I'll add it here.
Versions and OS
st2 version: 1.4.0-8
st2 mistral version: 0.1.2
OS: Ubuntu 14.04.4 LTS
edit 1: Added versions and OS info