Closed stewartshea closed 5 years ago
I've adjusted SSO environment variables:
DB_MIN_POOL_SIZE=5
DB_MAX_POOL_SIZE=20
That should give you enough room (3x20=60) to be under postgresql max_connections
which is currently set to 100. The extra room (~30, since each instance of postgresql also use connections for the cluster management) would be available for supporting rolling deployments.
Ideally we can have a pipeline built for keycloak to ensure our changes are always captured. Some other quick notes/ideas as placeholders:
destroy/recreate instead of rolling for the application deployments?
create additional operational testing namespace for SSO since it looks like our dev/test/prod environments are being used by project teams. (to allow us to test the application configuration changes, etc without impacting project teams)
init hooks in application frontend to check for db connection availability and trigger a DB connection garbage collection before launching if required?
test patroni configuration for postgres max connections (currently it's ignoring the config and is using the defaults)
from:
bootstrap:
dcs:
postgresql:
use_pg_rewind: true
max_connections: 500
max_prepared_transactions: 500
to:
bootstrap:
dcs:
postgresql:
use_pg_rewind: true
parameters:
max_connections: 500
max_prepared_transactions: 500
Hit this again today, we seem to still be holding / keeping stale or idle connections. Not sure if a cron will help here or not.
select max_conn,used,res_for_super,max_conn-used-res_for_super res_for_normal
from
select count(*) used from pg_stat_activity) t1,
(select setting::int res_for_super from pg_settings where name=$$superuser_reserved_connections$$) t2,
(select setting::int max_conn from pg_settings where name=$$max_connections$$) t3;
max_conn | used | res_for_super | res_for_normal
----------+------+---------------+----------------
100 | 104 | 3 | -7
(1 row)
postgres=#
Here's one method for patching the config; that said, we may want to build this into the image, or at least parameterize it better with an ENV var.
From within the pod
curl -s -XPATCH -d \
'{"loop_wait":5,"ttl":20,"postgresql":{"parameters":{"max_connections":"200"}}}' \
http://localhost:8008/config | jq .
https://patroni.readthedocs.io/en/latest/dynamic_configuration.html
Reload cluster
postgres@patroni-persistent-2:/home/postgres$ patronictl reload patroni-persistent
+--------------------+----------------------+---------------+--------+---------+-----------+
| Cluster | Member | Host | Role | State | Lag in MB |
+--------------------+----------------------+---------------+--------+---------+-----------+
| patroni-persistent | patroni-persistent-0 | 10.131.22.199 | | running | 0 |
| patroni-persistent | patroni-persistent-1 | 10.131.22.229 | Leader | running | 0 |
| patroni-persistent | patroni-persistent-2 | 10.131.22.231 | | running | 0 |
+--------------------+----------------------+---------------+--------+---------+-----------+
Are you sure you want to reload members patroni-persistent-0, patroni-persistent-1, patroni-persistent-2? [y/N]: y
The environment variable name for MIN/MAX pool size are incorrect. After reading RedHat EAP/XPAAS Documentation, the variables should actually be:
DB_TX_MIN_POOL_SIZE=5
DB_TX_MAX_POOL_SIZE=20
It is missing _TX
.
I've updated all 3 environments with a rolling deployment (no outage).
Now the connections are being constrained as expected: 3 replicas * 5 = 15
postgres=# SELECT datname, numbackends FROM pg_stat_database WHERE datname = 'rhsso';
datname | numbackends
---------+-------------
rhsso | 15
(1 row)
postgres=#
@cvarjao Do you think we need / want to modify anything on the DB side of things? I suspect I'd still like to see if I can use an env var in the entrypoint to dynamically set the connection limit on the db... maybe we want to do this under a separate Issue. Thoughts?
@stewartshea, @jefkel, a couple of things ...
The configuration file (/home/postgres/patroni.yml
) create in the entrypoint is not correct. It is missing the "parameters" block. So, it is just effectively being ignored (similar to using wrong env var name :( )
postgresql:
use_pg_rewind: true
parameters:
max_connections: 500
max_prepared_transactions: 500
Somebody else have made the same mistake: https://github.com/zalando/patroni/issues/727
It can be fixed by editing the annotation/configuration in the ConfigMap/patroni-persistent-config
.
oc edit ConfigMap/patroni-persistent-config
note that here we edit the config
annotation which is in JSON format.
After editing the configmap, we will need to fetch/reload the configuration and then restart. reload
itself will NOT restart, but it will mark that the member needs to be restarted.
After configuration have changed, we can then restart one member at the time, starting with followers and leave the leader as last one.
patronictl reload patroni-persistent
# followers
patronictl restart patroni-persistent patroni-persistent-0
patronictl restart patroni-persistent patroni-persistent-2
# leader
patronictl restart patroni-persistent patroni-persistent-1
Instead of reloading all members at once, we can also reload and restart one member at the time:
# followers
patronictl reload patroni-persistent patroni-persistent-0
patronictl restart patroni-persistent patroni-persistent-0
patronictl reload patroni-persistent patroni-persistent-2
patronictl restart patroni-persistent patroni-persistent-2
# leader
patronictl reload patroni-persistent patroni-persistent-1
patronictl restart patroni-persistent patroni-persistent-1
@stewartshea, I believe we will also need to increate max_connections
or cluster may get very slow under heavy load. But I guess it is also to be seen. I suspect we may still have apps doing unnecessary and overly frequent token checks.
I am keeping a troubleshooting wiki in the sso repo: https://github.com/bcgov/ocp-sso/wiki/Troubleshooting
Thanks @cvarjao ... do you want us to add a feature to config that parameter via ENV var?
@stewartshea , I am already work on a PR with a few changes to that image, as I am trying to get SSO under CD. I am making it more compatible with RedHat postgresql image so should be easier for teams to switch.
@cvarjao sounds good... I have another issue raised to get some version management under control as well. I would prefer not to fork and just contribute back to patroni, but of course we may need to fork depending on the changes you want to include as well.
Connection issues continue to cause database connection limits and SSO to fail. This issue will be used to track the issue.