[bitnami/postgresql-ha] Upgrading repmgr from 5.3 to 5.4

616slayer616 commented 5 months ago

Name and Version

bitnami/postgresql-ha 14.1.x

What architecture are you using?

amd64

What steps will reproduce the bug?

It seems there is a repmgr version 5.4 but there was no migration guide in the documentation. So when I upgraded from 14.0.3 to 14.0.13 or 14.1.2 I got the message

[NOTICE] repmgrd (repmgrd 5.4.1) starting up
[ERROR] an older version of the "repmgr" extension is installed
[DETAIL] extension version 5.3 is installed but newer version 5.4 is available

I used the migration instruction from 8.0.0 and set replicaCount to 1 and upgradeRepmgrExtension to true. Then I set the replicaCount to 3 again and it worked. So far so good. I have the same configuration on 2 more clusters and on those I could not get it to work:

$ helm upgrade --install pg-ha bitnami/postgresql-ha --version 14.1.2 -n db --set clusterDomain=cluster-one --set postgresql.replicaCount=1 --set postgresql.upgradeRepmgrExtension=true -f "postgresql-values.yaml"
Release "pg-ha" has been upgraded. Happy Helming!
NAME: pg-ha
LAST DEPLOYED: Mon May 27 07:08:56 2024
NAMESPACE: db
STATUS: deployed
REVISION: 55
TEST SUITE: None
NOTES:
CHART NAME: postgresql-ha
CHART VERSION: 14.1.2
APP VERSION: 16.3.0
** Please be patient while the chart is being deployed **
PostgreSQL can be accessed through Pgpool via port 5432 on the following DNS name from within your cluster:

    pg-ha-postgresql-ha-pgpool.db.svc.cluster-one

Pgpool acts as a load balancer for PostgreSQL and forward read/write connections to the primary node while read-only connections are forwarded to standby nodes.

To get the password for "postgres" run:

    export POSTGRES_PASSWORD=$(kubectl get secret --namespace db pg-ha-postgresql-ha-postgresql -o jsonpath="{.data.password}" | base64 -d)

To get the password for "repmgr" run:

    export REPMGR_PASSWORD=$(kubectl get secret --namespace db pg-ha-postgresql-ha-postgresql -o jsonpath="{.data.repmgr-password}" | base64 -d)

To connect to your database run the following command:

    kubectl run pg-ha-postgresql-ha-client --rm --tty -i --restart='Never' --namespace db --image docker.io/bitnami/postgresql-repmgr:16.3.0-debian-12-r8 --env="PGPASSWORD=$POSTGRES_PASSWORD"  \
        --command -- psql -h pg-ha-postgresql-ha-pgpool -p 5432 -U postgres -d postgres

To connect to your database from outside the cluster execute the following commands:

    kubectl port-forward --namespace db svc/pg-ha-postgresql-ha-pgpool 5432:5432 &
    psql -h 127.0.0.1 -p 5432 -U postgres -d postgres

WARNING: There are "resources" sections in the chart not set. Using "resourcesPreset" is not recommended for production. For production installations, please set the following values according to your workload needs:
  - volumePermissions.resources
  - witness.resources
+info https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

$ k logs -n db pg-ha-postgresql-ha-postgresql-0 -f
Defaulted container "postgresql" out of: postgresql, init-chmod-data (init)
postgresql-repmgr 07:09:20.34 INFO  ==> 
postgresql-repmgr 07:09:20.34 INFO  ==> Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 07:09:20.35 INFO  ==> Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 07:09:20.35 INFO  ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 07:09:20.35 INFO  ==> Upgrade to Tanzu Application Catalog for production environments to access custom-configured and pre-packaged software components. Gain enhanced features, including Software Bill of Materials (SBOM), CVE scan result reports, and VEX documents. To learn more, visit https://bitnami.com/enterprise
postgresql-repmgr 07:09:20.35 INFO  ==> 
postgresql-repmgr 07:09:20.37 INFO  ==> ** Starting PostgreSQL with Replication Manager setup **
postgresql-repmgr 07:09:20.39 INFO  ==> Validating settings in REPMGR_* env vars...
postgresql-repmgr 07:09:20.39 INFO  ==> Validating settings in POSTGRESQL_* env vars..
postgresql-repmgr 07:09:20.40 INFO  ==> Querying all partner nodes for common upstream node...
postgresql-repmgr 07:09:20.42 INFO  ==> There are no nodes with primary role. Assuming the primary role...
postgresql-repmgr 07:09:20.42 INFO  ==> Preparing PostgreSQL configuration...
postgresql-repmgr 07:09:20.42 INFO  ==> postgresql.conf file not detected. Generating it...
postgresql-repmgr 07:09:20.51 INFO  ==> Preparing repmgr configuration...
postgresql-repmgr 07:09:20.53 INFO  ==> Initializing Repmgr...
postgresql-repmgr 07:09:20.53 INFO  ==> Initializing PostgreSQL database...
postgresql-repmgr 07:09:20.54 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/postgresql.conf detected
postgresql-repmgr 07:09:20.54 INFO  ==> Custom configuration /opt/bitnami/postgresql/conf/pg_hba.conf detected
postgresql-repmgr 07:09:20.56 INFO  ==> Deploying PostgreSQL with persisted data...
postgresql-repmgr 07:09:20.58 INFO  ==> Configuring replication parameters
postgresql-repmgr 07:09:20.61 INFO  ==> Configuring fsync
postgresql-repmgr 07:09:20.62 INFO  ==> Starting PostgreSQL in background...
postgresql-repmgr 07:09:20.90 INFO  ==> Upgrading repmgr extension...
postgresql-repmgr 07:09:20.98 INFO  ==> Stopping PostgreSQL...
waiting for server to shut down.... done
server stopped

postgresql-repmgr 07:09:21.09 INFO  ==> ** PostgreSQL with Replication Manager setup finished! **
postgresql-repmgr 07:09:21.12 INFO  ==> Starting PostgreSQL in background...
waiting for server to start....2024-05-27 07:09:21.146 GMT [166] LOG:  pgaudit extension initialized
2024-05-27 07:09:21.160 GMT [166] LOG:  redirecting log output to logging collector process
2024-05-27 07:09:21.160 GMT [166] HINT:  Future log output will appear in directory "/opt/bitnami/postgresql/logs".
2024-05-27 07:09:21.160 GMT [166] LOG:  starting PostgreSQL 16.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-05-27 07:09:21.160 GMT [166] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-05-27 07:09:21.160 GMT [166] LOG:  listening on IPv6 address "::", port 5432
2024-05-27 07:09:21.184 GMT [166] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-05-27 07:09:21.189 GMT [170] LOG:  database system was shut down in recovery at 2024-05-27 07:09:20 GMT
2024-05-27 07:09:21.190 GMT [170] LOG:  entering standby mode
2024-05-27 07:09:21.194 GMT [170] LOG:  redo starts at 1/CE000028
2024-05-27 07:09:21.195 GMT [170] LOG:  consistent recovery state reached at 1/CF00C230
2024-05-27 07:09:21.195 GMT [170] LOG:  invalid record length at 1/CF00C230: expected at least 24, got 0
2024-05-27 07:09:21.195 GMT [166] LOG:  database system is ready to accept read-only connections
2024-05-27 07:09:21.216 GMT [171] FATAL:  could not connect to the primary server: could not translate host name "pg-ha-postgresql-ha-postgresql-1.pg-ha-postgresql-ha-postgresql-headless.db.svc.cluster-one" to address: Name or service not known
 done
server started
2024-05-27 07:09:21.236 GMT [172] FATAL:  could not connect to the primary server: could not translate host name "pg-ha-postgresql-ha-postgresql-1.pg-ha-postgresql-ha-postgresql-headless.db.svc.cluster-one" to address: Name or service not known
2024-05-27 07:09:21.236 GMT [170] LOG:  waiting for WAL to become available at 1/CF002000
postgresql-repmgr 07:09:21.24 INFO  ==> ** Starting repmgrd **
[2024-05-27 07:09:21] [NOTICE] repmgrd (repmgrd 5.4.1) starting up
[2024-05-27 07:09:21] [ERROR] an older version of the "repmgr" extension is installed
[2024-05-27 07:09:21] [DETAIL] extension version 5.3 is installed but newer version 5.4 is available
[2024-05-27 07:09:21] [HINT] verify the repmgr installation is updated properly before continuing

You can see that it claims to upgrade repmgr Upgrading repmgr extension but in the end it tells me extension version 5.3 is installed but newer version 5.4 is available

I downgraded the installation and it works with the old version so at least nothing is broken for now. But upgrading again shows the same issue. So I can reproduce it in this namespace but when I create another namespace, install the chart version 14.0.3 and then upgrade with upgradeRepmgrExtension=true it upgrades correctly. So it is not fully reproducible.

Can anyone help me? And why is there even this migration in a patch release (I think it is 14.0.10) and no mention in the upgrading section?

What is the expected behavior?

postgresql-ha-postgresql-0 statefulset scaling up correctly

What do you see instead?

CrashLoopBackOff and message extension version 5.3 is installed but newer version 5.4 is available

tulsluper commented 5 months ago

I'm facing the same issue on upgrading postgresql-ha from 14.0.5 to 14.0.6

tulsluper commented 5 months ago

I was able to update version by setting:

    postgresql.replicaCount=1
    postgresql.upgradeRepmgrExtension=true

as it was mentioned here - https://artifacthub.io/packages/helm/bitnami/postgresql-ha#to-8-0-0

rafariossaa commented 4 months ago

Hi, This issue is under investigation, we will be back as soon as we have news.

kreatoo commented 4 months ago

Also have the same issue. upgradeRepmgrExtension did not help. It says it is upgrading the repmgr extension but then gives the exact same error.


│ postgresql-repmgr 21:54:06.73 INFO  ==> Upgrading repmgr extension...                                                                                                                                                                │
│ postgresql-repmgr 21:54:06.81 INFO  ==> Stopping PostgreSQL...                                                                                                                                                                       │
│ waiting for server to shut down.... done                                                                                                                                                                                             │
│ server stopped                                                                                                                                                                                                                       │
│ postgresql-repmgr 21:54:06.92 INFO  ==> ** PostgreSQL with Replication Manager setup finished! **                                                                                                                                    │
│                                                                                                                                                                                                                                      │
│ postgresql-repmgr 21:54:06.95 INFO  ==> Starting PostgreSQL in background...                                                                                                                                                         │
│ waiting for server to start....2024-06-07 21:54:07.041 GMT [166] LOG:  pgaudit extension initialized                                                                                                                                 │
│ 2024-06-07 21:54:07.051 GMT [166] LOG:  redirecting log output to logging collector process                                                                                                                                          │
│ 2024-06-07 21:54:07.051 GMT [166] HINT:  Future log output will appear in directory "/opt/bitnami/postgresql/logs".                                                                                                                  │
│ 2024-06-07 21:54:07.051 GMT [166] LOG:  starting PostgreSQL 16.3 on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit                                                                                     │
│ 2024-06-07 21:54:07.052 GMT [166] LOG:  listening on IPv4 address "0.0.0.0", port 5432                                                                                                                                               │
│ 2024-06-07 21:54:07.052 GMT [166] LOG:  listening on IPv6 address "::", port 5432                                                                                                                                                    │
│ 2024-06-07 21:54:07.058 GMT [166] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"                                                                                                                                                │
│ 2024-06-07 21:54:07.067 GMT [170] LOG:  database system was shut down in recovery at 2024-06-07 21:54:06 GMT                                                                                                                         │
│ 2024-06-07 21:54:07.067 GMT [170] LOG:  entering standby mode                                                                                                                                                                        │
│ 2024-06-07 21:54:07.073 GMT [170] LOG:  redo starts at 4B/9D000028                                                                                                                                                                   │
│ 2024-06-07 21:54:07.074 GMT [170] LOG:  consistent recovery state reached at 4B/9E004888                                                                                                                                             │
│ 2024-06-07 21:54:07.074 GMT [170] LOG:  invalid record length at 4B/9E004888: expected at least 24, got 0                                                                                                                            │
│ 2024-06-07 21:54:07.115 GMT [166] LOG:  database system is ready to accept read-only connections                                                                                                                                     │
│ 2024-06-07 21:54:07.121 GMT [171] FATAL:  could not connect to the primary server: could not translate host name "redacted" to address │
│  done                                                                                                                                                                                                                                │
│ server started                                                                                                                                                                                                                       │
│ 2024-06-07 21:54:07.126 GMT [172] FATAL:  could not connect to the primary server: could not translate host name "redacted" to address │
│ 2024-06-07 21:54:07.126 GMT [170] LOG:  waiting for WAL to become available at 4B/9E002000                                                                                                                                           │
│ postgresql-repmgr 21:54:07.13 INFO  ==> ** Starting repmgrd **                                                                                                                                                                       │
│ [2024-06-07 21:54:07] [NOTICE] repmgrd (repmgrd 5.4.1) starting up                                                                                                                                                                   │
│ [2024-06-07 21:54:07] [ERROR] an older version of the "repmgr" extension is installed                                                                                                                                                │
│ [2024-06-07 21:54:07] [DETAIL] extension version 5.3 is installed but newer version 5.4 is available                                                                                                                                 │
│ [2024-06-07 21:54:07] [HINT] verify the repmgr installation is updated properly before continuing

P-n-I commented 4 months ago

Same issue when argo auto-updated our gitea install which has the postgres-ha chart as a dependency. Fix was as above; scale replicas down to 1 and set upgrad repmgr to true...wait for it to complete then 'undo' those changes to values.

rafariossaa commented 4 months ago

Hi, @P-n-I , @kreatoo could you indicate the versions (from and to) that you are using ?

P-n-I commented 4 months ago

14.0.2 (16.2 postgres) to 14.0.3 (16.3 postgres) according to my internal slack channel history.

rafariossaa commented 4 months ago

From 14.0.2 to 14.0.3 I found no issues. This is what I did:

$ helm install mypg bitnami/postgresql-ha --version=14.0.2 --set postgresql.password=adminpwd --set postgresql.repmgrPassword=repmgrpwd --set pgpool.adminPassword=pgpoolpwd

wait until it is up and running, and checked the status:

$ kubectl exec -it mypg-postgresql-ha-postgresql-0 -- /opt/bitnami/scripts/postgresql-repmgr/
...
 ID | Name                            | Role    | Status    | Upstream                        | repmgrd | PID | Paused? | Upstream last seen
----+---------------------------------+---------+-----------+---------------------------------+---------+-----+---------+--------------------
 1000 | mypg-postgresql-ha-postgresql-0 | primary | * running |                                 | running | 1   | no      | n/a                
 1001 | mypg-postgresql-ha-postgresql-1 | standby |   running | mypg-postgresql-ha-postgresql-0 | running | 1   | no      | 0 second(s) ago    
 1002 | mypg-postgresql-ha-postgresql-2 | standby |   running | mypg-postgresql-ha-postgresql-0 | running | 1   | no      | 0 second(s) ago

Then upgraded, and checked the status once all nodes are in Running state:

helm upgrade mypg bitnami/postgresql-ha --version=14.0.3 --set postgresql.password=adminpwd --set postgresql.repmgrPassword=repmgrpwd --set pgpool.adminPassword=pgpoolpwd 

$ kubectl exec -it mypg-postgresql-ha-postgresql-0 -- /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf daemon status
...
 ID | Name                            | Role    | Status    | Upstream                        | repmgrd | PID | Paused? | Upstream last seen
----+---------------------------------+---------+-----------+---------------------------------+---------+-----+---------+--------------------
 1000 | mypg-postgresql-ha-postgresql-0 | standby |   running | mypg-postgresql-ha-postgresql-1 | running | 1   | no      | 0 second(s) ago    
 1001 | mypg-postgresql-ha-postgresql-1 | primary | * running |                                 | running | 1   | no      | n/a                
 1002 | mypg-postgresql-ha-postgresql-2 | standby |   running | mypg-postgresql-ha-postgresql-1 | running | 1   | no      | 0 second(s) ago

rafariossaa commented 4 months ago

@616slayer616 , when upgrading from 14.0.2 to 14.0.13 I had the same behavior that you, you would need to scale to one node and use postgresql.upgradeRepmgrExtension=true, and then rescale the cluster.

This is not particular to this version "jump", but to any case where repmgr version was upgraded and it is not compatible. A message similar to this would appear in the logs:

postgresql-repmgr 13:57:07.50 INFO  ==> ** Starting repmgrd **
[2024-06-12 13:57:07] [NOTICE] repmgrd (repmgrd 5.4.1) starting up
[2024-06-12 13:57:07] [ERROR] an older version of the "repmgr" extension is installed
[2024-06-12 13:57:07] [DETAIL] extension version 5.3 is installed but newer version 5.4 is available

All the nodes in the cluster need to use the same version, hence the scaling to one node, upgrading repmgr, and then rescaling the cluster again.

It is true that there is not info on this regards on the upgrade section. I will add a note in the README for clarification.

616slayer616 commented 4 months ago

I tried upgrading from 14.0.3 to 14.1.1 and 14.2.0. I also scaled to one node and set postgresql.upgradeRepmgrExtension=true. I tried this about 20 times. And it did not help.

I created a new namespace where I deployed 14.0.3 and upgraded successfully. So it seems not to be a general error but something more specific. But I cannot imagine how any of my configuration could have caused this. Especially since I hardly have any custom configuration.

rafariossaa commented 4 months ago

I have not found any issues when upgrading from 14.0.3 to 14.1.1 or 14.2.0. I used the following commands:

helm install mypg bitnami/postgresql-ha --version=14.0.3 \
    --set postgresql.password=adminpwd \
    --set postgresql.repmgrPassword=repmgrpwd \
    --set pgpool.adminPassword=pgpoolpwd

helm upgrade mypg bitnami/postgresql-ha --version=14.1.1 \
    --set postgresql.password=adminpwd \
    --set postgresql.repmgrPassword=repmgrpwd \
    --set pgpool.adminPassword=pgpoolpwd \
    --set postgresql.upgradeRepmgrExtension=true \
    --set postgresql.replicaCount=1

helm upgrade mypg bitnami/postgresql-ha --version=14.1.1 \
    --set postgresql.password=adminpwd \
    --set postgresql.repmgrPassword=repmgrpwd \
    --set pgpool.adminPassword=pgpoolpwd \
    --set postgresql.replicaCount=3

Not sure if it could be related to the database size. In my testings I have not inserted any data in the database. I would need to have consistent way of reproducing the issue in order to debug it.

616slayer616 commented 4 months ago

I now tried it again in a new namespace and added all the data using pg_dump and pg_rsestore. And was not able to reproduce it. My plan now is to uninstall the chart, delete the pvcs and install it again and run pg_restore. but I would really like to know what the problem is here.

For debugging we could arrange a video call and I can share my screen. Otherwise I guess we won't get any further

rafariossaa commented 4 months ago

Hi, Thanks for sharing your progress. Please, don't hesitate to share your findings.

I am sorry, but GH issues is our communication channel to solve issues.

github-actions[bot] commented 4 months ago

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

gsbps01 commented 1 month ago

I am hitting this same issue, and I believe I found the source of the problem. In each of the error logs posted above, they both show the same message:

2024-05-27 07:09:21.216 GMT [171] FATAL:  could not connect to the primary server: could not translate host name "pg-ha-postgresql-ha-postgresql-1.pg-ha-postgresql-ha-postgresql-headless.db.svc.cluster-one" to address: Name or service not known

Postgresql still had reference to the previous master replica: postgresql-ha-postgresql-1, not ...-0. I scaled up my deployment to allow ...-1 to come up and upgrade, and it appears that node started up just fine. Then, I bounced ...-0 and it appears that also came up on the new version.

I do not believe this is the intended behavior, so I would recommend this issue re-opens. The upgrade process should be able to support any number of previous replicas, regardless of the prior master replica.

Additionally, I am still having trouble getting any other replicas besides these two up now, though I'm still running down if that is due to the upgrade or our own implementation.

EDIT: Appears to be directly related to https://github.com/bitnami/charts/issues/17015 The direction for upgrading postgresql-ha cannot be to scale to one replica if there is a known breaking issue with scaling to one replica.

bitnami / charts