Investigate Prod SSO connection issues

stewartshea commented 5 years ago

Connection issues continue to cause database connection limits and SSO to fail. This issue will be used to track the issue.

cvarjao commented 5 years ago

I've adjusted SSO environment variables: DB_MIN_POOL_SIZE=5 DB_MAX_POOL_SIZE=20

That should give you enough room (3x20=60) to be under postgresql max_connections which is currently set to 100. The extra room (~30, since each instance of postgresql also use connections for the cluster management) would be available for supporting rolling deployments.

jefkel commented 5 years ago

Ideally we can have a pipeline built for keycloak to ensure our changes are always captured. Some other quick notes/ideas as placeholders:

destroy/recreate instead of rolling for the application deployments?
create additional operational testing namespace for SSO since it looks like our dev/test/prod environments are being used by project teams. (to allow us to test the application configuration changes, etc without impacting project teams)
init hooks in application frontend to check for db connection availability and trigger a DB connection garbage collection before launching if required?
test patroni configuration for postgres max connections (currently it's ignoring the config and is using the defaults)

from:

bootstrap:
  dcs:
    postgresql:
      use_pg_rewind: true
      max_connections: 500
      max_prepared_transactions: 500

to:

bootstrap:
  dcs:
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 500
        max_prepared_transactions: 500

stewartshea commented 5 years ago

Hit this again today, we seem to still be holding / keeping stale or idle connections. Not sure if a cron will help here or not.

select max_conn,used,res_for_super,max_conn-used-res_for_super res_for_normal
from
select count(*) used from pg_stat_activity) t1,
(select setting::int res_for_super from pg_settings where name=$$superuser_reserved_connections$$) t2,
(select setting::int max_conn from pg_settings where name=$$max_connections$$) t3;

 max_conn | used | res_for_super | res_for_normal
----------+------+---------------+----------------
      100 |  104 |             3 |             -7
(1 row)

postgres=#

stewartshea commented 5 years ago

Here's one method for patching the config; that said, we may want to build this into the image, or at least parameterize it better with an ENV var.

From within the pod

curl -s -XPATCH -d \
    '{"loop_wait":5,"ttl":20,"postgresql":{"parameters":{"max_connections":"200"}}}' \
    http://localhost:8008/config | jq .

https://patroni.readthedocs.io/en/latest/dynamic_configuration.html

Reload cluster

postgres@patroni-persistent-2:/home/postgres$ patronictl reload patroni-persistent
+--------------------+----------------------+---------------+--------+---------+-----------+
|      Cluster       |        Member        |      Host     |  Role  |  State  | Lag in MB |
+--------------------+----------------------+---------------+--------+---------+-----------+
| patroni-persistent | patroni-persistent-0 | 10.131.22.199 |        | running |         0 |
| patroni-persistent | patroni-persistent-1 | 10.131.22.229 | Leader | running |         0 |
| patroni-persistent | patroni-persistent-2 | 10.131.22.231 |        | running |         0 |
+--------------------+----------------------+---------------+--------+---------+-----------+
Are you sure you want to reload members patroni-persistent-0, patroni-persistent-1, patroni-persistent-2? [y/N]: y

cvarjao commented 5 years ago

The environment variable name for MIN/MAX pool size are incorrect. After reading RedHat EAP/XPAAS Documentation, the variables should actually be:

DB_TX_MIN_POOL_SIZE=5
DB_TX_MAX_POOL_SIZE=20

It is missing _TX.

I've updated all 3 environments with a rolling deployment (no outage).

Now the connections are being constrained as expected: 3 replicas * 5 = 15

postgres=# SELECT datname, numbackends FROM pg_stat_database WHERE datname = 'rhsso';
 datname | numbackends 
---------+-------------
 rhsso   |          15
(1 row)

postgres=#

stewartshea commented 5 years ago

@cvarjao Do you think we need / want to modify anything on the DB side of things? I suspect I'd still like to see if I can use an env var in the entrypoint to dynamically set the connection limit on the db... maybe we want to do this under a separate Issue. Thoughts?

cvarjao commented 5 years ago

@stewartshea, @jefkel, a couple of things ...

The configuration file (/home/postgres/patroni.yml) create in the entrypoint is not correct. It is missing the "parameters" block. So, it is just effectively being ignored (similar to using wrong env var name :( )
```
postgresql:
  use_pg_rewind: true
  parameters:
    max_connections: 500
    max_prepared_transactions: 500
```
Somebody else have made the same mistake: https://github.com/zalando/patroni/issues/727
It can be fixed by editing the annotation/configuration in the ConfigMap/patroni-persistent-config.
```
oc edit ConfigMap/patroni-persistent-config
```
note that here we edit the config annotation which is in JSON format.

After editing the configmap, we will need to fetch/reload the configuration and then restart. reload itself will NOT restart, but it will mark that the member needs to be restarted.

After configuration have changed, we can then restart one member at the time, starting with followers and leave the leader as last one.

patronictl reload patroni-persistent

# followers
patronictl restart patroni-persistent patroni-persistent-0
patronictl restart patroni-persistent patroni-persistent-2

# leader
patronictl restart patroni-persistent patroni-persistent-1

Instead of reloading all members at once, we can also reload and restart one member at the time:

# followers
patronictl reload patroni-persistent patroni-persistent-0
patronictl restart patroni-persistent patroni-persistent-0

patronictl reload patroni-persistent patroni-persistent-2
patronictl restart patroni-persistent patroni-persistent-2

# leader
patronictl reload patroni-persistent patroni-persistent-1
patronictl restart patroni-persistent patroni-persistent-1

cvarjao commented 5 years ago

@stewartshea, I believe we will also need to increate max_connections or cluster may get very slow under heavy load. But I guess it is also to be seen. I suspect we may still have apps doing unnecessary and overly frequent token checks.

cvarjao commented 5 years ago

I am keeping a troubleshooting wiki in the sso repo: https://github.com/bcgov/ocp-sso/wiki/Troubleshooting

stewartshea commented 5 years ago

Thanks @cvarjao ... do you want us to add a feature to config that parameter via ENV var?

cvarjao commented 5 years ago

@stewartshea , I am already work on a PR with a few changes to that image, as I am trying to get SSO under CD. I am making it more compatible with RedHat postgresql image so should be easier for teams to switch.

stewartshea commented 5 years ago

@cvarjao sounds good... I have another issue raised to get some version management under control as well. I would prefer not to fork and just contribute back to patroni, but of course we may need to fork depending on the changes you want to include as well.

BCDevOps / platform-services

Investigate Prod SSO connection issues #54