CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.91k stars 587 forks source link

My wal position exceeds maximum replication lag on replica after primary down #2458

Closed Martin-Hogge closed 3 years ago

Martin-Hogge commented 3 years ago

Overview

In a cluster with one primary and one replica when the primary goes down the replica does not failover. It ends up with 2 replicas waiting for a primary that will never come until a manual failover.

Environment

Steps to Reproduce

REPRO

  1. Create the PGO cluster:
    pgo create cluster -n <namespace> <clustername> --metrics --sync-replication --replica-count=1 --pod-anti-affinity=required --pvc-size=12Gi --pgbackrest-s3-bucket=<backup-bucket> --pgbackrest-s3-endpoint=<minio-endpoint> --pgbackrest-s3-key=<minio-key> --pgbackrest-s3-key-secret=<minio-secret> --pgbackrest-s3-uri-style=path --pgbackrest-storage-type=s3 --pgbackrest-s3-region=us-east-1 --pgbackrest-s3-ca-secret=<minio-secret> --pod-anti-affinity-pgbackrest=disabled -d=mydatabase --annotation-postgres=cluster-autoscaler.kubernetes.io/safe-to-evict=true
  2. Run a service that is doing CREATE, UPDATE and SELECT statements (to simulate a production environment)
  3. Kill the node (VM) where the primary is running

EXPECTED

  1. The replica is becoming the new primary;
  2. The old primary is being recreated as a replica;

ACTUAL

  1. Replica is failing to become primary;
  2. Old primary is being recreated as a replica;

Logs

The node on which the primary was hosted has been killed around 14:04 pm.

Primary node logs

receiveTimestamp severity textPayload
2021-05-19T12:03:03.021703390Z ERROR 2021-05-19 12:03:02,892 INFO: no action.  i am the leader with the lock
2021-05-19T12:03:03.021703390Z ERROR 2021-05-19 12:03:02,782 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:03:13.038909884Z ERROR 2021-05-19 12:03:12,884 INFO: no action.  i am the leader with the lock
2021-05-19T12:03:13.038909884Z ERROR 2021-05-19 12:03:12,780 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:03:23.038097189Z ERROR 2021-05-19 12:03:22,886 INFO: no action.  i am the leader with the lock
2021-05-19T12:03:23.038097189Z ERROR 2021-05-19 12:03:22,780 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:03:33.018484539Z ERROR 2021-05-19 12:03:32,896 INFO: no action.  i am the leader with the lock
2021-05-19T12:03:33.018484539Z ERROR 2021-05-19 12:03:32,781 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:03:43.100396115Z ERROR 2021-05-19 12:03:42,889 INFO: no action.  i am the leader with the lock
2021-05-19T12:03:43.100396115Z ERROR 2021-05-19 12:03:42,781 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:03:53.050433677Z ERROR 2021-05-19 12:03:52,891 INFO: no action.  i am the leader with the lock
2021-05-19T12:03:53.050433677Z ERROR 2021-05-19 12:03:52,781 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:56 UTC 2021 INFO: Node <new-pod-name>v fully initialized for cluster <cluster-name> and is ready for use
2021-05-19T12:10:59.066004638Z INFO /tmp:5432 - accepting connections
2021-05-19T12:10:59.066004638Z INFO /tmp:5432 - rejecting connections
2021-05-19T12:10:59.066004638Z INFO /tmp:5432 - rejecting connections
2021-05-19T12:10:59.066004638Z INFO /tmp:5432 - no response
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Running Patroni as PID 1
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Initializing cluster bootstrap with command: '/usr/local/bin/patroni /tmp/postgres-ha-bootstrap.yaml'
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Now removing "pause" key from patroni.dynamic.json configuration file if present
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Starting background process to monitor Patroni initization and restart the database if needed
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Starting SSHD..
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: setting up .ssh directory
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Checking for sshd_config in /sshd
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Checking for authorized_keys in /sshd
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Checking for SSH Host Keys in /sshd..
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Applying SSHD..
2021-05-19T12:10:59.066004638Z INFO - host all all 0.0.0.0/0 md5
2021-05-19T12:10:59.066004638Z INFO - host all primaryuser 0.0.0.0/0 reject
2021-05-19T12:10:59.066004638Z INFO - host replication primaryuser 0.0.0.0/0 md5
2021-05-19T12:10:59.066004638Z INFO - local all postgres peer
2021-05-19T12:10:59.066004638Z INFO pg_hba:
2021-05-19T12:10:59.066004638Z INFO on_role_change: /opt/crunchy/bin/postgres-ha/callbacks/pgha-on-role-change.sh
2021-05-19T12:10:59.066004638Z INFO callbacks:
2021-05-19T12:10:59.066004638Z INFO remove_data_directory_on_rewind_failure: true
2021-05-19T12:10:59.066004638Z INFO no_master: 1
2021-05-19T12:10:59.066004638Z INFO no_params: true
2021-05-19T12:10:59.066004638Z INFO keep_data: true
2021-05-19T12:10:59.066004638Z INFO standby'
2021-05-19T12:10:59.066004638Z INFO command: '/opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-create-replica.sh
2021-05-19T12:10:59.066004638Z INFO pgbackrest_standby:
2021-05-19T12:10:59.066004638Z INFO no_params: true
2021-05-19T12:10:59.066004638Z INFO keep_data: true
2021-05-19T12:10:59.066004638Z INFO replica'
2021-05-19T12:10:59.066004638Z INFO command: '/opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-create-replica.sh
2021-05-19T12:10:59.066004638Z INFO pgbackrest:
2021-05-19T12:10:59.066004638Z INFO - basebackup
2021-05-19T12:10:59.066004638Z INFO - pgbackrest
2021-05-19T12:10:59.066004638Z INFO create_replica_methods:
2021-05-19T12:10:59.066004638Z INFO pgpass: /tmp/.pgpass
2021-05-19T12:10:59.066004638Z INFO use_unix_socket: true
2021-05-19T12:10:59.066004638Z INFO postgresql:
2021-05-19T12:10:59.066004638Z INFO post_bootstrap: /opt/crunchy/bin/postgres-ha/bootstrap/post-bootstrap.sh
2021-05-19T12:10:59.066004638Z INFO synchronous_mode: true
2021-05-19T12:10:59.066004638Z INFO && pgbackrest archive-get %f "%p"'
2021-05-19T12:10:59.066004638Z INFO restore_command: 'source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-set-env.sh
2021-05-19T12:10:59.066004638Z INFO recovery_conf:
2021-05-19T12:10:59.066004638Z INFO use_slots: false
2021-05-19T12:10:59.066004638Z INFO synchronous_standby_names: "*"
2021-05-19T12:10:59.066004638Z INFO synchronous_commit: "on"
2021-05-19T12:10:59.066004638Z INFO && pgbackrest archive-push "%p"'
2021-05-19T12:10:59.066004638Z INFO archive_command: 'source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-set-env.sh
2021-05-19T12:10:59.066004638Z INFO archive_mode: on
2021-05-19T12:10:59.066004638Z INFO wal_level: logical
2021-05-19T12:10:59.066004638Z INFO unix_socket_directories: /tmp
2021-05-19T12:10:59.066004638Z INFO jit: off
2021-05-19T12:10:59.066004638Z INFO parameters:
2021-05-19T12:10:59.066004638Z INFO postgresql:
2021-05-19T12:10:59.066004638Z INFO dcs:
2021-05-19T12:10:59.066004638Z INFO keep_existing_recovery_conf: true
2021-05-19T12:10:59.066004638Z INFO command: '/opt/crunchy/bin/postgres-ha/bootstrap/create-from-existing.sh'
2021-05-19T12:10:59.066004638Z INFO existing_init:
2021-05-19T12:10:59.066004638Z INFO keep_existing_recovery_conf: true
2021-05-19T12:10:59.066004638Z INFO primary'
2021-05-19T12:10:59.066004638Z INFO command: '/opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-create-replica.sh
2021-05-19T12:10:59.066004638Z INFO pgbackrest_init:
2021-05-19T12:10:59.066004638Z INFO method: existing_init
2021-05-19T12:10:59.066004638Z INFO bootstrap:
2021-05-19T12:10:59.066004638Z INFO **
2021-05-19T12:10:59.066004638Z INFO Patroni configuration file:
2021-05-19T12:10:59.066004638Z INFO **
2021-05-19T12:10:59.066004638Z INFO Patroni bootstrap method: existing_init
2021-05-19T12:10:59.066004638Z INFO **
2021-05-19T12:10:59.066004638Z INFO PATRONI_KUBERNETES_NAMESPACE=<namespace>
2021-05-19T12:10:59.066004638Z INFO PATRONI_KUBERNETES_SCOPE_LABEL=crunchy-pgha-scope
2021-05-19T12:10:59.066004638Z INFO PATRONI_KUBERNETES_LABELS={vendor: "crunchydata"}
2021-05-19T12:10:59.066004638Z INFO PATRONI_LOG_LEVEL=INFO
2021-05-19T12:10:59.066004638Z INFO PATRONI_RESTAPI_CONNECT_ADDRESS=10.60.0.39:8009
2021-05-19T12:10:59.066004638Z INFO PATRONI_POSTGRESQL_DATA_DIR=/pgdata/<cluster-name>-ezzy
2021-05-19T12:10:59.066004638Z INFO PATRONI_RESTAPI_LISTEN=0.0.0.0:8009
2021-05-19T12:10:59.066004638Z INFO PATRONI_SCOPE=<cluster-name>
2021-05-19T12:10:59.066004638Z INFO PATRONI_NAME=<new-pod-name>v
2021-05-19T12:10:59.066004638Z INFO PATRONI_POSTGRESQL_LISTEN=0.0.0.0:5432
2021-05-19T12:10:59.066004638Z INFO PATRONI_POSTGRESQL_CONNECT_ADDRESS=10.60.0.39:5432
2021-05-19T12:10:59.066004638Z INFO **
2021-05-19T12:10:59.066004638Z INFO Patroni env vars:
2021-05-19T12:10:59.066004638Z INFO **
2021-05-19T12:10:59.066004638Z INFO PGHA_DATABASE=<db-name>
2021-05-19T12:10:59.066004638Z INFO PGHA_TLS_ENABLED=false
2021-05-19T12:10:59.066004638Z INFO PGHA_TLS_ONLY=false
2021-05-19T12:10:59.066004638Z INFO PGHA_STANDBY=false
2021-05-19T12:10:59.066004638Z INFO PGHA_PGBACKREST_S3_VERIFY_TLS=true
2021-05-19T12:10:59.066004638Z INFO PGHA_BASE_BOOTSTRAP_CONFIG=true
2021-05-19T12:10:59.066004638Z INFO PGHA_PGBACKREST=true
2021-05-19T12:10:59.066004638Z INFO PGHA_REPLICA_REINIT_ON_START_FAIL=true
2021-05-19T12:10:59.066004638Z INFO PGHA_DEFAULT_CONFIG=true
2021-05-19T12:10:59.066004638Z INFO PGHA_USER=postgres
2021-05-19T12:10:59.066004638Z INFO PGHA_SYNC_REPLICATION=true
2021-05-19T12:10:59.066004638Z INFO PGHA_PGBACKREST_LOCAL_S3_STORAGE=false
2021-05-19T12:10:59.066004638Z INFO PGHA_PG_PORT=5432
2021-05-19T12:10:59.066004638Z INFO PGHA_PATRONI_PORT=8009
2021-05-19T12:10:59.066004638Z INFO PGHA_BASE_PG_CONFIG=true
2021-05-19T12:10:59.066004638Z INFO **
2021-05-19T12:10:59.066004638Z INFO postgres-ha (PGHA) env vars:
2021-05-19T12:10:59.066004638Z INFO **
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: postgres-ha pre-bootstrap complete!  The following configuration will be utilized to initialize 
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Finished building postgres-ha configuration file '/tmp/postgres-ha-bootstrap.yaml'
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Custom postgres-ha configuration file not detected
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Applying standard (non-TLS) remote connection configuration to pg_hba.conf
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Applying synchronous replication settings to postgres-ha configuration
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:53 UTC 2021 INFO: Applying pgbackrest config to postgres-ha configuration
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: Applying base postgres config to postgres-ha configuration
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: Applying base bootstrap config to postgres-ha configuration
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: Setting 'replicator' credentials using file system
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: Setting 'superuser' credentials using file system
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: Setting 'pguser' credentials using file system
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: Setting postgres-ha configuration for database user credentials
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: Defaults have been set for the following Patroni env vars: PATRONI_NAME, PATRONI_RESTAPI_LISTEN, PATRONI_RESTAPI_CONNECT_ADDRESS, PATRONI_POSTGRESQL_LISTEN, PATRONI_POSTGRESQL_CONNECT_ADDRESS
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: Defaults have been set for the following postgres-ha env vars: PGHA_PATRONI_PORT
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: Defaults have been set for the following postgres-ha auto-configuration env vars: PGHA_DEFAULT_CONFIG, PGHA_BASE_BOOTSTRAP_CONFIG, PGHA_BASE_PG_CONFIG
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: PGHA_PGBACKREST_LOCAL_S3_STORAGE and PGHA_PGBACKREST_INITIALIZE will be ignored if provided
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: pgBackRest auto-config disabled
2021-05-19T12:10:59.066004638Z INFO Wed May 19 12:10:52 UTC 2021 INFO: postgres-ha pre-bootstrap starting...
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:56,360 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:56,118 INFO: establishing a new patroni connection to the postgres cluster
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53.917 UTC [157] HINT:  Future log output will appear in directory "pg_log".
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53.917 UTC [157] LOG:  redirecting log output to logging collector process
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,880 INFO: postmaster pid=157
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53.878 UTC [157] LOG:  pgaudit extension initialized
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,578 INFO: starting as a secondary
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,521 INFO: Lock owner: None; I am <new-pod-name>v
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,520 INFO: Reaped pid=151, exit status=0
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,507 INFO: Lock owner: None; I am <new-pod-name>v
2021-05-19T12:10:59.073286754Z ERROR Mock authentication nonce: be2659d33a39d4ead0617544ec33dd656e1a0810f7adec1977e350c8d942789b
2021-05-19T12:10:59.073286754Z ERROR Data page checksum version: 1
2021-05-19T12:10:59.073286754Z ERROR Float8 argument passing: by value
2021-05-19T12:10:59.073286754Z ERROR Date/time type storage: 64-bit integers
2021-05-19T12:10:59.073286754Z ERROR Size of a large-object chunk: 2048
2021-05-19T12:10:59.073286754Z ERROR Maximum size of a TOAST chunk: 1996
2021-05-19T12:10:59.073286754Z ERROR Maximum columns in an index: 32
2021-05-19T12:10:59.073286754Z ERROR Maximum length of identifiers: 64
2021-05-19T12:10:59.073286754Z ERROR Bytes per WAL segment: 16777216
2021-05-19T12:10:59.073286754Z ERROR WAL block size: 8192
2021-05-19T12:10:59.073286754Z ERROR Blocks per segment of large relation: 131072
2021-05-19T12:10:59.073286754Z ERROR Database block size: 8192
2021-05-19T12:10:59.073286754Z ERROR Maximum data alignment: 8
2021-05-19T12:10:59.073286754Z ERROR track_commit_timestamp setting: off
2021-05-19T12:10:59.073286754Z ERROR max_locks_per_xact setting: 64
2021-05-19T12:10:59.073286754Z ERROR max_prepared_xacts setting: 0
2021-05-19T12:10:59.073286754Z ERROR max_wal_senders setting: 6
2021-05-19T12:10:59.073286754Z ERROR max_worker_processes setting: 8
2021-05-19T12:10:59.073286754Z ERROR max_connections setting: 100
2021-05-19T12:10:59.073286754Z ERROR wal_log_hints setting: on
2021-05-19T12:10:59.073286754Z ERROR wal_level setting: logical
2021-05-19T12:10:59.073286754Z ERROR End-of-backup record required: no
2021-05-19T12:10:59.073286754Z ERROR Backup end location: 0/0
2021-05-19T12:10:59.073286754Z ERROR Backup start location: 0/0
2021-05-19T12:10:59.073286754Z ERROR Min recovery ending loc's timeline: 0
2021-05-19T12:10:59.073286754Z ERROR Minimum recovery ending location: 0/0
2021-05-19T12:10:59.073286754Z ERROR Fake LSN counter for unlogged rels: 0/3E8
2021-05-19T12:10:59.073286754Z ERROR Time of latest checkpoint: Wed May 19 12:04:11 2021
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's newestCommitTsXid: 0
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's oldestCommitTsXid: 0
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's oldestMulti's DB: 1
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's oldestMultiXid: 1
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's oldestActiveXID: 0
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's oldestXID's DB: 1
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's oldestXID: 479
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's NextMultiOffset: 0
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's NextMultiXactId: 1
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's NextOID: 22711
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's NextXID: 0:168407
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's full_page_writes: on
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's PrevTimeLineID: 3
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's TimeLineID: 3
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's REDO WAL file: 000000030000000C00000074
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint's REDO location: C/74000028
2021-05-19T12:10:59.073286754Z ERROR Latest checkpoint location: C/74000028
2021-05-19T12:10:59.073286754Z ERROR pg_control last modified: Wed May 19 12:04:12 2021
2021-05-19T12:10:59.073286754Z ERROR Database cluster state: shut down
2021-05-19T12:10:59.073286754Z ERROR Database system identifier: 6963238027115905180
2021-05-19T12:10:59.073286754Z ERROR Catalog version number: 202007201
2021-05-19T12:10:59.073286754Z ERROR pg_control version number: 1300
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,506 INFO: pg_controldata:
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,505 INFO: Reaped pid=149, exit status=0
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,500 INFO: Lock owner: None; I am <new-pod-name>v
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,500 WARNING: Postgresql is not running.
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,499 INFO: Reaped pid=148, exit status=0
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,491 INFO: Reaped pid=145, exit status=0
2021-05-19T12:10:59.073286754Z ERROR 2021-05-19 12:10:53,485 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-05-19T12:10:59.073286754Z ERROR WARNING: 'UsePAM no' is not supported in Fedora and may cause several problems.
2021-05-19T12:11:09.023184122Z ERROR 2021-05-19 12:11:06,626 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:11:19.023086268Z ERROR 2021-05-19 12:11:16,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:11:29.022071885Z ERROR 2021-05-19 12:11:26,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:11:40.012709532Z ERROR 2021-05-19 12:11:36,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:11:49.026812618Z ERROR 2021-05-19 12:11:46,626 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:11:59.029442068Z ERROR 2021-05-19 12:11:56,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:09.057492951Z ERROR 2021-05-19 12:12:06,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:19.058773010Z ERROR 2021-05-19 12:12:16,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:29.029192144Z ERROR 2021-05-19 12:12:26,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:39.027054871Z ERROR 2021-05-19 12:12:36,628 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:49.039960984Z ERROR 2021-05-19 12:12:46,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:59.024903532Z ERROR 2021-05-19 12:12:56,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:09.027837323Z ERROR 2021-05-19 12:13:06,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:19.051154421Z ERROR 2021-05-19 12:13:16,626 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:29.026947206Z ERROR 2021-05-19 12:13:26,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:39.028711352Z ERROR 2021-05-19 12:13:36,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:49.027312740Z ERROR 2021-05-19 12:13:46,627 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:59.030673019Z ERROR 2021-05-19 12:13:56,627 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:09.028556846Z ERROR 2021-05-19 12:14:06,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:19.029514504Z ERROR 2021-05-19 12:14:16,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:29.030810524Z ERROR 2021-05-19 12:14:26,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:39.031587907Z ERROR 2021-05-19 12:14:36,626 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:49.031315841Z ERROR 2021-05-19 12:14:46,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:59.054734460Z ERROR 2021-05-19 12:14:56,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:09.063950025Z ERROR 2021-05-19 12:15:06,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:19.031738630Z ERROR 2021-05-19 12:15:16,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:24.047773812Z INFO time="2021-05-19T12:15:16Z" level=info msg="Starting Server: :9187" source="postgres_exporter.go:1672"
2021-05-19T12:15:24.047773812Z INFO time="2021-05-19T12:15:16Z" level=info msg="Semantic Version Changed on \"127.0.0.1:5432\": 0.0.0 -> 13.2.0" source="postgres_exporter.go:1405"
2021-05-19T12:15:24.047773812Z INFO time="2021-05-19T12:15:16Z" level=info msg="Established new database connection to \"127.0.0.1:5432\"." source="postgres_exporter.go:878"
2021-05-19T12:15:24.047773812Z INFO Wed May 19 12:15:16 UTC 2021 INFO: Starting postgres-exporter..
2021-05-19T12:15:24.047773812Z INFO Wed May 19 12:15:15 UTC 2021 INFO: No custom queries detected. Applying default configuration..
2021-05-19T12:15:24.047773812Z INFO (1 row)
2021-05-19T12:15:24.047773812Z INFO 1
2021-05-19T12:15:24.047773812Z INFO ----------
2021-05-19T12:15:24.047773812Z INFO ?column?
2021-05-19T12:15:24.047773812Z INFO Wed May 19 12:15:15 UTC 2021 INFO: Checking if ccp_monitoring is is created..
2021-05-19T12:15:24.047773812Z INFO Wed May 19 12:15:15 UTC 2021 INFO: Waiting for PostgreSQL to be ready..
2021-05-19T12:15:29.034298982Z ERROR 2021-05-19 12:15:26,627 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:39.038424421Z ERROR 2021-05-19 12:15:36,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:49.030632111Z ERROR 2021-05-19 12:15:46,627 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:59.044899395Z ERROR 2021-05-19 12:15:56,625 INFO: following a different leader because i am not the healthiest node

As we can see the primary node has been shut down around 14:03 pm and a new node has been created but it doesn't consider itsefl as the healthiest node.

Replica logs

receiveTimestamp severity textPayload
2021-05-19T12:03:07.017325622Z ERROR 2021-05-19 12:03:02,896 INFO: no action.  i am a secondary and i am following a leader
2021-05-19T12:03:07.017325622Z ERROR 2021-05-19 12:03:02,834 INFO: does not have lock
2021-05-19T12:03:07.017325622Z ERROR 2021-05-19 12:03:02,834 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:03:17.040610373Z ERROR 2021-05-19 12:03:12,900 INFO: no action.  i am a secondary and i am following a leader
2021-05-19T12:03:17.040610373Z ERROR 2021-05-19 12:03:12,831 INFO: does not have lock
2021-05-19T12:03:17.040610373Z ERROR 2021-05-19 12:03:12,831 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:03:27.021520047Z ERROR 2021-05-19 12:03:22,888 INFO: no action.  i am a secondary and i am following a leader
2021-05-19T12:03:27.021520047Z ERROR 2021-05-19 12:03:22,832 INFO: does not have lock
2021-05-19T12:03:27.021520047Z ERROR 2021-05-19 12:03:22,832 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:03:37.011061049Z ERROR 2021-05-19 12:03:32,913 INFO: no action.  i am a secondary and i am following a leader
2021-05-19T12:03:37.011061049Z ERROR 2021-05-19 12:03:32,837 INFO: does not have lock
2021-05-19T12:03:37.011061049Z ERROR 2021-05-19 12:03:32,836 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:03:47.009963801Z ERROR 2021-05-19 12:03:42,881 INFO: no action.  i am a secondary and i am following a leader
2021-05-19T12:03:47.009963801Z ERROR 2021-05-19 12:03:42,828 INFO: does not have lock
2021-05-19T12:03:47.009963801Z ERROR 2021-05-19 12:03:42,828 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:03:57.054579573Z ERROR 2021-05-19 12:03:52,882 INFO: no action.  i am a secondary and i am following a leader
2021-05-19T12:03:57.054579573Z ERROR 2021-05-19 12:03:52,830 INFO: does not have lock
2021-05-19T12:03:57.054579573Z ERROR 2021-05-19 12:03:52,830 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:04:07.017100663Z ERROR 2021-05-19 12:04:02,897 INFO: no action.  i am a secondary and i am following a leader
2021-05-19T12:04:07.017100663Z ERROR 2021-05-19 12:04:02,829 INFO: does not have lock
2021-05-19T12:04:07.017100663Z ERROR 2021-05-19 12:04:02,829 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:04:17.021264686Z INFO server signaled
2021-05-19T12:04:17.232277232Z ERROR 2021-05-19 12:04:12,392 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:04:17.232277232Z ERROR 2021-05-19 12:04:12,333 INFO: Reaped pid=1493, exit status=0
2021-05-19T12:04:17.232277232Z ERROR 2021-05-19 12:04:12,306 INFO: My wal position exceeds maximum replication lag
2021-05-19T12:04:27.060252771Z ERROR 2021-05-19 12:04:22,813 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:04:27.060252771Z ERROR 2021-05-19 12:04:22,806 INFO: My wal position exceeds maximum replication lag
2021-05-19T12:04:37.073647491Z ERROR 2021-05-19 12:04:32,812 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:04:37.073647491Z ERROR 2021-05-19 12:04:32,806 INFO: My wal position exceeds maximum replication lag
2021-05-19T12:04:47.020805261Z ERROR 2021-05-19 12:04:42,812 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:04:47.020805261Z ERROR 2021-05-19 12:04:42,806 INFO: My wal position exceeds maximum replication lag
2021-05-19T12:04:57.033420746Z ERROR 2021-05-19 12:04:52,817 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:04:57.033420746Z ERROR 2021-05-19 12:04:52,809 INFO: My wal position exceeds maximum replication lag

As we can see, the replica started to say My wal position exceeds maximum replication lag just after the node has been killed. And it's still saying the same thing 1 hour later.

Doing a manual failover is resolving the issue. But I would like to know how can I avoid this kind of scenario ? Why does the replica did not took the primary role ?

In case this is a "normal" scenario, isn't it a way to configure the maximum replication lag in order to be less strict ?

Thanks for reading and thank you in advance for any advice.

jkatz commented 3 years ago

This is documented:

https://access.crunchydata.com/documentation/postgres-operator/latest/tutorial/high-availability/#no-primary-available-after-both-synchronous-replication-instances-fail

Please also see a longer discussion on this in #2132.

Martin-Hogge commented 3 years ago

Thanks for the reply. But the link you provide talk about a scenario where both the primary and the replica goes down. In my case only the primary went down. I could in fact do a failover manually like it is recommended in the documentation you linked, but I would prefer to avoid this situation at first.

In case this can't be avoided, do you see any side effects of having a cron running every x minutes checking if there is at least one primary up and if not electing one ?

Thank you.

iosifnicolae2 commented 2 years ago

This is documented:

https://access.crunchydata.com/documentation/postgres-operator/latest/tutorial/high-availability/#no-primary-available-after-both-synchronous-replication-instances-fail

Please also see a longer discussion on this in #2132.

How can I run pgo failover --target using the v5 Crunchy Database Operator?

Thank you!

jkatz commented 2 years ago

This was added in #2794 and will be available in an upcoming release.