BCDevOps / platform-services

Collection of platform related tools and configurations
Apache License 2.0
13 stars 29 forks source link

Investigate Prod SSO connection issues #54

Closed stewartshea closed 5 years ago

stewartshea commented 5 years ago

Connection issues continue to cause database connection limits and SSO to fail. This issue will be used to track the issue.

cvarjao commented 5 years ago

I've adjusted SSO environment variables: DB_MIN_POOL_SIZE=5 DB_MAX_POOL_SIZE=20

That should give you enough room (3x20=60) to be under postgresql max_connections which is currently set to 100. The extra room (~30, since each instance of postgresql also use connections for the cluster management) would be available for supporting rolling deployments.

jefkel commented 5 years ago

Ideally we can have a pipeline built for keycloak to ensure our changes are always captured. Some other quick notes/ideas as placeholders:

from:

bootstrap:
  dcs:
    postgresql:
      use_pg_rewind: true
      max_connections: 500
      max_prepared_transactions: 500

to:

bootstrap:
  dcs:
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 500
        max_prepared_transactions: 500
stewartshea commented 5 years ago

Hit this again today, we seem to still be holding / keeping stale or idle connections. Not sure if a cron will help here or not.

select max_conn,used,res_for_super,max_conn-used-res_for_super res_for_normal
from
select count(*) used from pg_stat_activity) t1,
(select setting::int res_for_super from pg_settings where name=$$superuser_reserved_connections$$) t2,
(select setting::int max_conn from pg_settings where name=$$max_connections$$) t3;

 max_conn | used | res_for_super | res_for_normal
----------+------+---------------+----------------
      100 |  104 |             3 |             -7
(1 row)

postgres=#
stewartshea commented 5 years ago

Here's one method for patching the config; that said, we may want to build this into the image, or at least parameterize it better with an ENV var.

cvarjao commented 5 years ago

The environment variable name for MIN/MAX pool size are incorrect. After reading RedHat EAP/XPAAS Documentation, the variables should actually be:

DB_TX_MIN_POOL_SIZE=5
DB_TX_MAX_POOL_SIZE=20

It is missing _TX.

I've updated all 3 environments with a rolling deployment (no outage).

Now the connections are being constrained as expected: 3 replicas * 5 = 15

postgres=# SELECT datname, numbackends FROM pg_stat_database WHERE datname = 'rhsso';
 datname | numbackends 
---------+-------------
 rhsso   |          15
(1 row)

postgres=# 
stewartshea commented 5 years ago

@cvarjao Do you think we need / want to modify anything on the DB side of things? I suspect I'd still like to see if I can use an env var in the entrypoint to dynamically set the connection limit on the db... maybe we want to do this under a separate Issue. Thoughts?

cvarjao commented 5 years ago

@stewartshea, @jefkel, a couple of things ...

  1. The configuration file (/home/postgres/patroni.yml) create in the entrypoint is not correct. It is missing the "parameters" block. So, it is just effectively being ignored (similar to using wrong env var name :( )

    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 500
        max_prepared_transactions: 500

    Somebody else have made the same mistake: https://github.com/zalando/patroni/issues/727

  2. It can be fixed by editing the annotation/configuration in the ConfigMap/patroni-persistent-config.

    oc edit ConfigMap/patroni-persistent-config

    note that here we edit the config annotation which is in JSON format.

  3. After editing the configmap, we will need to fetch/reload the configuration and then restart. reload itself will NOT restart, but it will mark that the member needs to be restarted.

    1. After configuration have changed, we can then restart one member at the time, starting with followers and leave the leader as last one.

      patronictl reload patroni-persistent
      
      # followers
      patronictl restart patroni-persistent patroni-persistent-0
      patronictl restart patroni-persistent patroni-persistent-2
      
      # leader
      patronictl restart patroni-persistent patroni-persistent-1
    2. Instead of reloading all members at once, we can also reload and restart one member at the time:

      # followers
      patronictl reload patroni-persistent patroni-persistent-0
      patronictl restart patroni-persistent patroni-persistent-0
      
      patronictl reload patroni-persistent patroni-persistent-2
      patronictl restart patroni-persistent patroni-persistent-2
      
      # leader
      patronictl reload patroni-persistent patroni-persistent-1
      patronictl restart patroni-persistent patroni-persistent-1
cvarjao commented 5 years ago

@stewartshea, I believe we will also need to increate max_connections or cluster may get very slow under heavy load. But I guess it is also to be seen. I suspect we may still have apps doing unnecessary and overly frequent token checks.

cvarjao commented 5 years ago

I am keeping a troubleshooting wiki in the sso repo: https://github.com/bcgov/ocp-sso/wiki/Troubleshooting

stewartshea commented 5 years ago

Thanks @cvarjao ... do you want us to add a feature to config that parameter via ENV var?

cvarjao commented 5 years ago

@stewartshea , I am already work on a PR with a few changes to that image, as I am trying to get SSO under CD. I am making it more compatible with RedHat postgresql image so should be easier for teams to switch.

stewartshea commented 5 years ago

@cvarjao sounds good... I have another issue raised to get some version management under control as well. I would prefer not to fork and just contribute back to patroni, but of course we may need to fork depending on the changes you want to include as well.