Standby Cluster after In-Place PITR Sync Failure

Overview

This may be simply a lack in documentation for this particular scenario, as I can't seem to find anywhere how to handle this situation. We have two clusters in different datacenters, with data replicated to the standby by way of S3. After performing an in-place PITR on the master (which everything works), I want to make sure its state is reflect in the standby cluster. I delete and recreate the standby cluster to attempt to force it to sync with the current state of the primary after the in-place PITR was done. The standby cluster fails to properly bootstrap/initialize.

Environment

Platform: Kubernetes
Platform Version: (1.18.18)
PGO Image Tag: ubi8-5.0.4-0
Postgres Version: 13
Storage: s3

Steps to Reproduce

1.) Have an environment with 2 clusters, one primary and one standby. Use S3 to replicate from primary to standby 2.) Perform an in-place PITR recovery on the primary as described in: https://access.crunchydata.com/documentation/postgres-operator/5.0.4/tutorial/disaster-recovery/ 3.) After primary is healthy again, delete the standby and recreate from a blank state 4.) Observe the state of the pods failing to initialize

EXPECTED

I expect the standby cluster to properly initialize, and have the same state as the primary cluster after the in-place PITR

ACTUAL

The standby cluster fails to boot, and the database containers never become healthy

Logs

Extract from one of the standby database containers that is failing to initialize:

2022-02-04 19:57:14,202 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-02-04 19:57:14,207 INFO: Lock owner: None; I am harbor-db-harbor-db-7thp-0
2022-02-04 19:57:14,230 INFO: trying to bootstrap (without leader)
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg13' to confirm that this is a valid $PGDATA directory.  --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
2022-02-04 19:57:24,707 INFO: Lock owner: None; I am harbor-db-harbor-db-7thp-0
2022-02-04 19:57:24,707 INFO: not healthy enough for leader race
2022-02-04 19:57:24,718 INFO: bootstrap (without leader) in progress
2022-02-04 19:57:26,644 INFO: replica has been created using pgbackrest
2022-02-04 19:57:26,645 INFO: bootstrapped (without leader)
2022-02-04 19:57:26.892 UTC [121] LOG:  pgaudit extension initialized
2022-02-04 19:57:26.907 UTC [121] LOG:  redirecting log output to logging collector process
2022-02-04 19:57:26.907 UTC [121] HINT:  Future log output will appear in directory "log".
2022-02-04 19:57:26,907 INFO: postmaster pid=121
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - no response
2022-02-04 19:57:27,959 WARNING: Postgresql is not running.
2022-02-04 19:57:27,959 INFO: Lock owner: None; I am harbor-db-harbor-db-7thp-0
2022-02-04 19:57:27,964 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202007201
  Database system identifier: 7057319451731464255
  Database cluster state: in production
  pg_control last modified: Fri Feb  4 19:30:09 2022
  Latest checkpoint location: 0/CA000060
  Latest checkpoint's REDO location: 0/CA000028
  Latest checkpoint's REDO WAL file: 0000001800000000000000CA
  Latest checkpoint's TimeLineID: 24
  Latest checkpoint's PrevTimeLineID: 24
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:897
  Latest checkpoint's NextOID: 57399
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 478
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 897
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Fri Feb  4 19:30:09 2022
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: logical
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: d318b263511266a6d310d7aad8d801e535aabc0f28fab3c069549b09a0b8161b

2022-02-04 19:57:27,977 INFO: doing crash recovery in a single user mode
2022-02-04 19:57:28,135 WARNING: Postgresql is not running.
2022-02-04 19:57:28,135 INFO: Lock owner: None; I am harbor-db-harbor-db-7thp-0
2022-02-04 19:57:28,143 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202007201
  Database system identifier: 7057319451731464255
  Database cluster state: shut down
  pg_control last modified: Fri Feb  4 19:57:28 2022
  Latest checkpoint location: 0/CC000028
  Latest checkpoint's REDO location: 0/CC000028
  Latest checkpoint's REDO WAL file: 0000001800000000000000CC
  Latest checkpoint's TimeLineID: 24
  Latest checkpoint's PrevTimeLineID: 24
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:897
  Latest checkpoint's NextOID: 57399
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 478
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Fri Feb  4 19:57:28 2022
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: logical
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 1
  Mock authentication nonce: d318b263511266a6d310d7aad8d801e535aabc0f28fab3c069549b09a0b8161b

2022-02-04 19:57:28,157 INFO: Lock owner: None; I am harbor-db-harbor-db-7thp-0
2022-02-04 19:57:28,158 INFO: trying to follow a remote master because standby cluster is unhealthy
2022-02-04 19:57:28.386 UTC [146] LOG:  pgaudit extension initialized
2022-02-04 19:57:28,387 INFO: postmaster pid=146
/tmp/postgres:5432 - no response
2022-02-04 19:57:28.401 UTC [146] LOG:  redirecting log output to logging collector process
2022-02-04 19:57:28.401 UTC [146] HINT:  Future log output will appear in directory "log".

Additional Information

Standby cluster YAML:

Spec:
  Backups:
    Pgbackrest:
      Global:
        repo1-path:                 {repo_path}
        repo1-retention-full:       7
        repo1-retention-full-type:  time
        repo1-s3-key:               {s3_access_key}
        repo1-s3-key-secret:        {s3_secret_key}
        repo1-s3-uri-style:         path
      Jobs:
        Resources:
          Limits:
            Cpu:     250m
            Memory:  128Mi
          Requests:
            Cpu:     250m
            Memory:  128Mi
      Manual:
        Options:
          --type=full
        Repo Name:  repo1
      Repos:
        Name:  repo1
        s3:
          Bucket:    {s3_bucket_name}
          Endpoint:  {s3_endpoint}
          Region:    {s3_region}
        Schedules:
          Full:         0 1 * * *
          Incremental:  */30 * * * *
      Restore:
        Enabled:    false
        Repo Name:  repo1
        Resources:
          Limits:
            Cpu:     500m
            Memory:  256Mi
          Requests:
            Cpu:     500m
            Memory:  256Mi
      Sidecars:
        Pgbackrest:
          Resources:
            Limits:
              Cpu:     250m
              Memory:  128Mi
            Requests:
              Cpu:     250m
              Memory:  128Mi
  Instances:
    Data Volume Claim Spec:
      Access Modes:
        ReadWriteOnce
      Resources:
        Requests:
          Storage:         10Gi
      Storage Class Name:  {sc_name}
    Name:                  harbor-db
    Replicas:              2
    Resources:
      Limits:
        Cpu:     1000m
        Memory:  1Gi
      Requests:
        Cpu:     500m
        Memory:  1Gi
    Sidecars:
      Replica Cert Copy:
        Resources:
          Limits:
            Cpu:     125m
            Memory:  64Mi
          Requests:
            Cpu:     125m
            Memory:  64Mi
  Openshift:         false
  Patroni:
    Dynamic Configuration:
      Postgresql:
        pg_hba:
          host all all 0.0.0.0/0 md5
    Leader Lease Duration Seconds:  30
    Port:                           8008
    Sync Period Seconds:            10
  Port:                             5432
  Postgres Version:                 13
  Proxy:
    Pg Bouncer:
      Port:      5432
      Replicas:  2
      Resources:
        Limits:
          Cpu:     500m
          Memory:  256Mi
        Requests:
          Cpu:     500m
          Memory:  256Mi
      Service:
        Type:  NodePort
      Sidecars:
        Pgbouncer Config:
          Resources:
            Limits:
              Cpu:     100m
              Memory:  64Mi
            Requests:
              Cpu:     100m
              Memory:  64Mi
  Standby:
    Enabled:    true
    Repo Name:  repo1
  Users:
    Databases:
      notary_server
      notary_signer
      registry
    Name:  harbor

It would be valuable to know (and document) what steps need to be taken to ensure a standby cluster, replicated from the primary via S3, is properly synced with the primary after an in-place PITR.

CrunchyData / postgres-operator