CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.93k stars 592 forks source link

Cluster Startup fails when the old lead node is down #3778

Closed David-Angel closed 8 months ago

David-Angel commented 11 months ago

Overview

After performing a cluster shutdown the cluster will never come up if the node the lead was on is removed or can not be brought back online.

Environment

Kubernetes, 3 or more node cluster using sync replication

Steps to Reproduce

REPRO

Follow the instructions for shutdown kubectl patch postgrescluster/hippo -n postgres-operator --type merge --patch '{"spec":{"shutdown": true}}'

Remove or disable the node the lead was on.
This is to verify we can recover the cluster in a disaster scenario.

Start the cluster back up kubectl patch postgrescluster/hippo -n postgres-operator --type merge --patch '{"spec":{"shutdown": false}}'

The lead pod will remain in pending state forever. Performing a failover or switchover fails as the target replica must be online.

The workaround I have come up for this is

  1. Turn off CrunchyData processing
  2. Edit the statefulsets of the replicas setting their replicas count from zero to 1
  3. Delete the PVC (not PV) of the old lead with --wait=false
  4. Delete the old lead pod
  5. Turn on CrunchyData processing

It would be very nice to have a simpler method of resolving this please.

jmckulk commented 11 months ago

Hey @David-Angel, I did a quick test using two nodes and the primary was recreated on the working node. However, I just noticed that you mentioned sync replication, so I did not test that. If you want to share your spec I'll try to replicate the issue again.

David-Angel commented 11 months ago

Here is a cut down version with the relevant info.

cat <<EOF | k apply -f -

  apiVersion: postgres-operator.crunchydata.com/v1beta1
  kind: PostgresCluster
  metadata:
    finalizers:
    - postgres-operator.crunchydata.com/finalizer
    name: postgres-ha
    namespace: default
  spec:
    customReplicationTLSSecret:
      name: postgres-ha-repl-tls
    customTLSSecret:
      name: postgres-ha-tls
    instances:
    - dataVolumeClaimSpec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 20Gi
        storageClassName: local-path-with-retention
      name: test1
      replicas: 3
      resources:
        requests:
          cpu: 200m
          memory: 512Mi
    patroni:
      dynamicConfiguration:
        postgresql:
          parameters:
            log_destination: stderr
            log_hostname: false
            log_line_prefix: 'postgres, %t, app=%a, db=%d, user=%u, host=%h, %p: '
            log_statement: none
            logging_collector: "off"
            max_connections: 300
            max_wal_size: 2GB
            max_worker_processes: 8
            pgnodemx.kdapi_enabled: true
            pgnodemx.kdapi_path: /etc/database-containerinfo
            shared_buffers: 64MB
            shared_preload_libraries: timescaledb,pgnodemx
            synchronous_commit: "on"
            timescaledb.license: apache
            timescaledb.max_background_workers: 8
            timescaledb.telemetry_level: "off"
            wal_keep_size: 10GB
        synchronous_mode: true
        synchronous_node_count: 2
      leaderLeaseDurationSeconds: 30
      port: 8008
      syncPeriodSeconds: 10
    port: 5432
    postgresVersion: 14

EOF
jmckulk commented 11 months ago

I tried again with sync replication enabled and was still unable to replicate the issue. Do you have any logs that might point at the issue? How are you taking the node offline?

David-Angel commented 11 months ago

We are using this to shutdown kubectl patch $(k get postgrescluster -o name) --type merge --patch '{"spec":{"shutdown": true}}'

No logs to speak of. For taking the node offline we shut it down in vSphere.

tjmoore4 commented 8 months ago

@David-Angel I did some testing for this scenario as well and wasn't able to replicate your error scenario either. It might be helpful to have the describe output from any pending objects (such as StatefulSets, PVCs, etc). Without more to go on, I don't think we'll be able to properly identify the source of the error. If you're able to provide more information, I'd be happy to take a closer look.

tjmoore4 commented 8 months ago

Since we haven't heard back on this issue for some time, I am closing this issue. If you need further assistance, feel free to re-open this issue or ask a question in our Discord server.