bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
8.64k stars 9k forks source link

[bitnami/postgresql-ha] scale down to 1 makes cluster unavailable if primary was not on 0 node #17015

Open farcop opened 1 year ago

farcop commented 1 year ago

Name and Version

bitnami/postgresql-ha 11.7.4

What architecture are you using?

amd64

What steps will reproduce the bug?

  1. helm install postgresql-ha https://charts.bitnami.com/bitnami/postgresql-ha-11.7.4.tgz --set postgresql.replicaCount=3
  2. force primary move from node 0

    I have no name!@postgresql-ha-postgresql-0:/$ /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show
    postgresql-repmgr 07:50:54.00 
    postgresql-repmgr 07:50:54.00 Welcome to the Bitnami postgresql-repmgr container
    postgresql-repmgr 07:50:54.00 Subscribe to project updates by watching https://github.com/bitnami/containers
    postgresql-repmgr 07:50:54.01 Submit issues and feature requests at https://github.com/bitnami/containers/issues
    postgresql-repmgr 07:50:54.01 
    
    ID   | Name                       | Role    | Status    | Upstream                   | Location | Priority | Timeline | Connection string                                                                                                                                                    
    ------+----------------------------+---------+-----------+----------------------------+----------+----------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    1000 | postgresql-ha-postgresql-0 | standby |   running | postgresql-ha-postgresql-1 | default  | 100      | 2        | user=repmgr password=HsboJALhW9 host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
    1001 | postgresql-ha-postgresql-1 | primary | * running |                            | default  | 100      | 2        | user=repmgr password=HsboJALhW9 host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
    1002 | postgresql-ha-postgresql-2 | standby |   running | postgresql-ha-postgresql-1 | default  | 100      | 1        | user=repmgr password=HsboJALhW9 host=postgresql-ha-postgresql-2.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
  3. helm upgrade postgresql-ha https://charts.bitnami.com/bitnami/postgresql-ha-11.7.4.tgz \
    --set postgresql.password=$PASSWORD \
    --set postgresql.repmgrPassword=$REPMGR_PASSWORD \
    --set pgpool.adminPassword=$ADMIN_PASSWORD \
    --set postgresql.replicaCount=1

    Are you using any custom parameters or values?

No response

What is the expected behavior?

cluster become alive

What do you see instead?

cluster unavailable Back-off restarting failed container with postgresql-repmgr on node 0

postgresql-repmgr 07:41:13.23 
postgresql-repmgr 07:41:13.23 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 07:41:13.23 Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 07:41:13.24 Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 07:41:13.24 
postgresql-repmgr 07:41:13.26 INFO  ==> ** Starting PostgreSQL with Replication Manager setup **
postgresql-repmgr 07:41:13.29 INFO  ==> Validating settings in REPMGR_* env vars...
postgresql-repmgr 07:41:13.29 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql-repmgr 07:41:13.30 INFO  ==> Querying all partner nodes for common upstream node...
postgresql-repmgr 07:41:13.32 INFO  ==> There are no nodes with primary role. Assuming the primary role...
postgresql-repmgr 07:41:13.33 INFO  ==> Preparing PostgreSQL configuration...
postgresql-repmgr 07:41:13.34 INFO  ==> postgresql.conf file not detected. Generating it...
postgresql-repmgr 07:41:13.47 INFO  ==> Preparing repmgr configuration...
postgresql-repmgr 07:41:13.49 INFO  ==> Initializing Repmgr...
postgresql-repmgr 07:41:13.50 INFO  ==> Initializing PostgreSQL database...
postgresql-repmgr 07:41:13.50 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
postgresql-repmgr 07:41:13.51 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
postgresql-repmgr 07:41:13.54 INFO  ==> Deploying PostgreSQL with persisted data...
postgresql-repmgr 07:41:13.58 INFO  ==> Configuring replication parameters
postgresql-repmgr 07:41:13.62 INFO  ==> Configuring fsync
postgresql-repmgr 07:41:13.63 INFO  ==> ** PostgreSQL with Replication Manager setup finished! **
postgresql-repmgr 07:41:13.66 INFO  ==> Starting PostgreSQL in background...
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2023-06-04 07:41:13.701 GMT [145] LOG:  pgaudit extension initialized
2023-06-04 07:41:13.718 GMT [145] LOG:  redirecting log output to logging collector process
2023-06-04 07:41:13.718 GMT [145] HINT:  Future log output will appear in directory "/opt/bitnami/postgresql/logs".
2023-06-04 07:41:13.718 GMT [145] LOG:  starting PostgreSQL 15.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
2023-06-04 07:41:13.719 GMT [145] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-06-04 07:41:13.719 GMT [145] LOG:  listening on IPv6 address "::", port 5432
2023-06-04 07:41:13.725 GMT [145] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-06-04 07:41:13.732 GMT [149] LOG:  database system was interrupted while in recovery at log time 2023-06-04 07:33:31 GMT
2023-06-04 07:41:13.732 GMT [149] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2023-06-04 07:41:13.803 GMT [149] LOG:  entering standby mode
2023-06-04 07:41:13.808 GMT [149] LOG:  redo starts at 0/A000028
2023-06-04 07:41:13.808 GMT [149] LOG:  consistent recovery state reached at 0/B001790
2023-06-04 07:41:13.808 GMT [149] LOG:  invalid record length at 0/B001790: wanted 24, got 0
2023-06-04 07:41:13.808 GMT [145] LOG:  database system is ready to accept read-only connections
2023-06-04 07:41:13.823 GMT [150] FATAL:  could not connect to the primary server: could not translate host name "postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local" to address: Name or service not known
2023-06-04 07:41:13.829 GMT [151] FATAL:  could not connect to the primary server: could not translate host name "postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local" to address: Name or service not known
2023-06-04 07:41:13.829 GMT [149] LOG:  waiting for WAL to become available at 0/B0017A8
 done
server started
postgresql-repmgr 07:41:13.89 INFO  ==> ** Starting repmgrd **
[2023-06-04 07:41:13] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-06-04 07:41:13] [NOTICE] starting monitoring of node "postgresql-ha-postgresql-0" (ID: 1000)
[2023-06-04 07:41:13] [ERROR] connection to database failed
[2023-06-04 07:41:13] [DETAIL] 
could not translate host name "postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local" to address: Name or service not known

[2023-06-04 07:41:13] [DETAIL] attempted to connect using:
  user=repmgr password=d8ZFvEW1pG connect_timeout=5 dbname=repmgr host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local port=5432 fallback_application_name=repmgr options=-csearch_path=
[2023-06-04 07:41:13] [ERROR] unable connect to upstream node (ID: 1001), terminating
[2023-06-04 07:41:13] [HINT] upstream node must be running before repmgrd can start

Additional information

No response

farcop commented 1 year ago

If I use witness node (witness.create=true), then only scaling of statefullset works as expected. kubectl scale statefulsets postgresql-ha-postgresql --replicas=1

[2023-06-04 15:55:14] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5"
[2023-06-04 15:55:14] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:55:14] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj connect_timeout=5 dbname=repmgr host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.default.svc.cluster.local port=5432 fallback_application_name=repmgr"
[2023-06-04 15:55:14] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:55:17] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj connect_timeout=5 dbname=repmgr host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.default.svc.cluster.local port=5432 fallback_application_name=repmgr"
[2023-06-04 15:55:17] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:55:17] [WARNING] unable to reconnect to node 1000 after 2 attempts
[2023-06-04 15:55:18] [NOTICE] witness node "postgresql-ha-postgresql-witness-0" (ID: 2000) now following new primary node "postgresql-ha-postgresql-1" (ID: 1001)
[2023-06-04 15:58:13] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5"
[2023-06-04 15:58:13] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:58:13] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj connect_timeout=5 dbname=repmgr host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local port=5432 fallback_application_name=repmgr"
[2023-06-04 15:58:13] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:58:16] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj connect_timeout=5 dbname=repmgr host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local port=5432 fallback_application_name=repmgr"
[2023-06-04 15:58:16] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:58:16] [WARNING] unable to reconnect to node 1001 after 2 attempts
[2023-06-04 15:58:17] [NOTICE] witness node "postgresql-ha-postgresql-witness-0" (ID: 2000) now following new primary node "postgresql-ha-postgresql-0" (ID: 1000)

But if I scale with helm upgrade ... --set postgresql.replicaCount=1 as written in the README, then cluster does not become alive as mentioned above.

corico44 commented 1 year ago

I can't reproduce the error. When executing the command you provide, I don't get the same output as you: Are you executing any other command to be able to perform that step?

I have no name!@postgresql-ha-postgresql-0:/$ /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show
postgresql-repmgr 16:51:49.62 
postgresql-repmgr 16:51:49.62 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 16:51:49.63 Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 16:51:49.63 Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 16:51:49.63 

 ID   | Name                       | Role    | Status    | Upstream                   | Location | Priority | Timeline | Connection string                                                                                                                                                    
------+----------------------------+---------+-----------+----------------------------+----------+----------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1000 | postgresql-ha-postgresql-0 | primary | * running |                            | default  | 100      | 1        | user=repmgr password=wDllZToicy host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
 1001 | postgresql-ha-postgresql-1 | standby |   running | postgresql-ha-postgresql-0 | default  | 100      | 1        | user=repmgr password=wDllZToicy host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
 1002 | postgresql-ha-postgresql-2 | standby |   running | postgresql-ha-postgresql-0 | default  | 100      | 1        | user=repmgr password=wDllZToicy host=postgresql-ha-postgresql-2.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
farcop commented 1 year ago

@corico44 Sorry, I did not describe how I force primary to move from node 0.

To emulate state described

I have no name!@postgresql-ha-postgresql-0:/$ /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show
postgresql-repmgr 07:50:54.00 
postgresql-repmgr 07:50:54.00 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 07:50:54.00 Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 07:50:54.01 Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 07:50:54.01 

 ID   | Name                       | Role    | Status    | Upstream                   | Location | Priority | Timeline | Connection string                                                                                                                                                    
------+----------------------------+---------+-----------+----------------------------+----------+----------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 1000 | postgresql-ha-postgresql-0 | standby |   running | postgresql-ha-postgresql-1 | default  | 100      | 2        | user=repmgr password=HsboJALhW9 host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
 1001 | postgresql-ha-postgresql-1 | primary | * running |                            | default  | 100      | 2        | user=repmgr password=HsboJALhW9 host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
 1002 | postgresql-ha-postgresql-2 | standby |   running | postgresql-ha-postgresql-1 | default  | 100      | 1        | user=repmgr password=HsboJALhW9 host=postgresql-ha-postgresql-2.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5

I cordon all nodes at node group, then I delete pod postgresql-ha-postgresql-0 thus making primary shift to postgresql-ha-postgresql-1 node, and then uncordon.

Can you please let me know if you understand or should I be more specific?

farcop commented 1 year ago

@corico44 @javsalgar Hello colleagues! This seems to be a critical defect, could you look into it as a priority?

corico44 commented 1 year ago

Hello @farcop,

I'm going to open a task internally to investigate this behavior. Thanks for reporting it! We will notify in this issue any update.

farcop commented 7 months ago

@corico44 @javsalgar Hi! Any update here?

carrodher commented 7 months ago

Due to other priorities in the team, our internal task is still on our backlog.

If you're interested in contributing a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

github-actions[bot] commented 7 months ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

farcop commented 7 months ago

Not stale