Postgres replicas bootstrap error at new deployment

mboncalo commented 7 months ago

Good day, I was trying to deploy a posgtres cluster on kubernetes but instead of everything being straightforward, like for an usual k8s ha database, when I tried to deploy a postgres instance with 3 replicas, I got 3 separate statefulset instances instead of 3 nodes for an instance. Two of the instances are trying to bootstrap themselves using one primary but they can't connect to it and are in error state. Even if I don't exactly understand the Crunchy Postgres architecture, shouldn't a default deployment work without any issues ? Nothing was changed beside the replicas number in the helm deployment I am using the latest helm chart version 5.5.1 but it happens the same with kustomize The main primary postgres is healthy and waiting for connections. Connection to postgres works without any issues

bash-4.4$ patronictl list
+ Cluster: service-1-ha (7353550171163455623) -------------------+---------+---------+----+-----------+
| Member                 | Host                                  | Role    | State   | TL | Lag in MB |
+------------------------+---------------------------------------+---------+---------+----+-----------+
| service-1-pgha1-72dn-0 | service-1-pgha1-72dn-0.service-1-pods | Replica | stopped |    |   unknown |
| service-1-pgha1-7c55-0 | service-1-pgha1-7c55-0.service-1-pods | Leader  | running |  1 |           |
| service-1-pgha1-kx8g-0 | service-1-pgha1-kx8g-0.service-1-pods | Replica | stopped |    |   unknown |

Postgres-ha yaml with kustomize:

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: service-1
  namespace: postgres
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-16.2-0
  postgresVersion: 16
  instances:
    - name: pgha1
      replicas: 3
      dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  postgres-operator.crunchydata.com/cluster: service-1
                  postgres-operator.crunchydata.com/instance-set: pgha1
              topologyKey: failure-domain.beta.kubernetes.io/zone
            weight: 10 
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  postgres-operator.crunchydata.com/cluster: service-1
                  postgres-operator.crunchydata.com/instance-set: pgha1
              topologyKey: kubernetes.io/hostname
            weight: 10
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.49-0
      global:
        repo1-path: /pgbackrest/postgres-operator/hippo-s3/repo1
      repos:
      - name: repo1
        schedules:
          full: "*/5 * * * *"
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 1Gi
  proxy:
    pgBouncer:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbouncer:ubi8-1.21-3
      replicas: 3
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  postgres-operator.crunchydata.com/cluster: service-1
                  postgres-operator.crunchydata.com/role: pgbouncer
              topologyKey: failure-domain.beta.kubernetes.io/zone
            weight: 10 
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  postgres-operator.crunchydata.com/cluster: service-1
                  postgres-operator.crunchydata.com/role: pgbouncer
              topologyKey: kubernetes.io/hostname
            weight: 10

With helm:

postgresVersion: 16
instances:
  - name: service-1
    replicas: 3
    dataVolumeClaimSpec:
      accessModes:
      - "ReadWriteOnce"
      resources:
        requests:
          storage: 1Gi
patroni:
  dynamicConfiguration:
    synchronous_mode: true
    postgresql:
      parameters:
        synchronous_commit: "on"
backupsSize: 1Gi
backupsStorageClassName: "default"

Logs from replica instances:

2024-04-02 15:07:19,456 ERROR: Error when fetching backup: pg_basebackup exited with code=1
2024-04-02 15:07:19,457 ERROR: failed to bootstrap from leader 'service-1-service-1-thmj-0'
2024-04-02 15:07:19,457 INFO: Removing data directory: /pgdata/pg16
2024-04-02 15:07:24,409 INFO: Lock owner: service-1-service-1-thmj-0; I am service-1-service-1-p9ks-0
2024-04-02 15:07:24,411 INFO: trying to bootstrap from leader 'service-1-service-1-thmj-0'
pg_basebackup: error: connection to server at "service-1-service-1-thmj-0.service-1-pods" (10.253.11.34), port 5432 failed: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

mboncalo commented 6 months ago

Hi guys, It seems there was an istio issue so pods could not connect between them Postgres issue is solved after istio was fixed

andrewlecuyer commented 6 months ago

Thanks for following up @mboncalo! Glad to hear your issue has been resolved.

CrunchyData / postgres-operator

Postgres replicas bootstrap error at new deployment #3880