Open farcop opened 1 year ago
If I use witness node (witness.create=true), then only scaling of statefullset works as expected.
kubectl scale statefulsets postgresql-ha-postgresql --replicas=1
[2023-06-04 15:55:14] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5"
[2023-06-04 15:55:14] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:55:14] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj connect_timeout=5 dbname=repmgr host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.default.svc.cluster.local port=5432 fallback_application_name=repmgr"
[2023-06-04 15:55:14] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:55:17] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj connect_timeout=5 dbname=repmgr host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.default.svc.cluster.local port=5432 fallback_application_name=repmgr"
[2023-06-04 15:55:17] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:55:17] [WARNING] unable to reconnect to node 1000 after 2 attempts
[2023-06-04 15:55:18] [NOTICE] witness node "postgresql-ha-postgresql-witness-0" (ID: 2000) now following new primary node "postgresql-ha-postgresql-1" (ID: 1001)
[2023-06-04 15:58:13] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5"
[2023-06-04 15:58:13] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:58:13] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj connect_timeout=5 dbname=repmgr host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local port=5432 fallback_application_name=repmgr"
[2023-06-04 15:58:13] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:58:16] [WARNING] unable to ping "user=repmgr password=IrbKbpW4Sj connect_timeout=5 dbname=repmgr host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local port=5432 fallback_application_name=repmgr"
[2023-06-04 15:58:16] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-06-04 15:58:16] [WARNING] unable to reconnect to node 1001 after 2 attempts
[2023-06-04 15:58:17] [NOTICE] witness node "postgresql-ha-postgresql-witness-0" (ID: 2000) now following new primary node "postgresql-ha-postgresql-0" (ID: 1000)
But if I scale with helm upgrade ... --set postgresql.replicaCount=1
as written in the README, then cluster does not become alive as mentioned above.
I can't reproduce the error. When executing the command you provide, I don't get the same output as you: Are you executing any other command to be able to perform that step?
I have no name!@postgresql-ha-postgresql-0:/$ /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show
postgresql-repmgr 16:51:49.62
postgresql-repmgr 16:51:49.62 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 16:51:49.63 Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 16:51:49.63 Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 16:51:49.63
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
------+----------------------------+---------+-----------+----------------------------+----------+----------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000 | postgresql-ha-postgresql-0 | primary | * running | | default | 100 | 1 | user=repmgr password=wDllZToicy host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
1001 | postgresql-ha-postgresql-1 | standby | running | postgresql-ha-postgresql-0 | default | 100 | 1 | user=repmgr password=wDllZToicy host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
1002 | postgresql-ha-postgresql-2 | standby | running | postgresql-ha-postgresql-0 | default | 100 | 1 | user=repmgr password=wDllZToicy host=postgresql-ha-postgresql-2.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
@corico44 Sorry, I did not describe how I force primary to move from node 0.
To emulate state described
I have no name!@postgresql-ha-postgresql-0:/$ /opt/bitnami/scripts/postgresql-repmgr/entrypoint.sh repmgr -f /opt/bitnami/repmgr/conf/repmgr.conf cluster show
postgresql-repmgr 07:50:54.00
postgresql-repmgr 07:50:54.00 Welcome to the Bitnami postgresql-repmgr container
postgresql-repmgr 07:50:54.00 Subscribe to project updates by watching https://github.com/bitnami/containers
postgresql-repmgr 07:50:54.01 Submit issues and feature requests at https://github.com/bitnami/containers/issues
postgresql-repmgr 07:50:54.01
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
------+----------------------------+---------+-----------+----------------------------+----------+----------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
1000 | postgresql-ha-postgresql-0 | standby | running | postgresql-ha-postgresql-1 | default | 100 | 2 | user=repmgr password=HsboJALhW9 host=postgresql-ha-postgresql-0.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
1001 | postgresql-ha-postgresql-1 | primary | * running | | default | 100 | 2 | user=repmgr password=HsboJALhW9 host=postgresql-ha-postgresql-1.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
1002 | postgresql-ha-postgresql-2 | standby | running | postgresql-ha-postgresql-1 | default | 100 | 1 | user=repmgr password=HsboJALhW9 host=postgresql-ha-postgresql-2.postgresql-ha-postgresql-headless.default.svc.cluster.local dbname=repmgr port=5432 connect_timeout=5
I cordon all nodes at node group, then I delete pod postgresql-ha-postgresql-0 thus making primary shift to postgresql-ha-postgresql-1 node, and then uncordon.
Can you please let me know if you understand or should I be more specific?
@corico44 @javsalgar Hello colleagues! This seems to be a critical defect, could you look into it as a priority?
Hello @farcop,
I'm going to open a task internally to investigate this behavior. Thanks for reporting it! We will notify in this issue any update.
@corico44 @javsalgar Hi! Any update here?
Due to other priorities in the team, our internal task is still on our backlog.
If you're interested in contributing a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.
Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.
This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.
Not stale
This issue is still there. I'll try to bump the priority
Name and Version
bitnami/postgresql-ha 11.7.4
What architecture are you using?
amd64
What steps will reproduce the bug?
force primary move from node 0
Are you using any custom parameters or values?
No response
What is the expected behavior?
cluster become alive
What do you see instead?
cluster unavailable Back-off restarting failed container with postgresql-repmgr on node 0
Additional information
No response