My wal position exceeds maximum replication lag on replica after primary down

Overview

In a cluster with one primary and one replica when the primary goes down the replica does not failover. It ends up with 2 replicas waiting for a primary that will never come until a manual failover.

Environment

Platform: Kubernetes (GKE)
Platform Version: 4.6.2
PGO Image Tag: crunchy-postgres-ha:centos8-13.2-4.6.2
Postgres Version: PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), 64-bit
Storage: GKE default storage class

Steps to Reproduce

REPRO

Create the PGO cluster:

pgo create cluster -n <namespace> <clustername> --metrics --sync-replication --replica-count=1 --pod-anti-affinity=required --pvc-size=12Gi --pgbackrest-s3-bucket=<backup-bucket> --pgbackrest-s3-endpoint=<minio-endpoint> --pgbackrest-s3-key=<minio-key> --pgbackrest-s3-key-secret=<minio-secret> --pgbackrest-s3-uri-style=path --pgbackrest-storage-type=s3 --pgbackrest-s3-region=us-east-1 --pgbackrest-s3-ca-secret=<minio-secret> --pod-anti-affinity-pgbackrest=disabled -d=mydatabase --annotation-postgres=cluster-autoscaler.kubernetes.io/safe-to-evict=true

Run a service that is doing CREATE, UPDATE and SELECT statements (to simulate a production environment)
Kill the node (VM) where the primary is running

EXPECTED

The replica is becoming the new primary;
The old primary is being recreated as a replica;

ACTUAL

Replica is failing to become primary;
Old primary is being recreated as a replica;

Logs

The node on which the primary was hosted has been killed around 14:04 pm.

Primary node logs

receiveTimestamp	severity	textPayload
2021-05-19T12:03:03.021703390Z	ERROR	2021-05-19 12:03:02,892 INFO: no action. i am the leader with the lock
2021-05-19T12:03:03.021703390Z	ERROR	2021-05-19 12:03:02,782 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:03:13.038909884Z	ERROR	2021-05-19 12:03:12,884 INFO: no action. i am the leader with the lock
2021-05-19T12:03:13.038909884Z	ERROR	2021-05-19 12:03:12,780 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:03:23.038097189Z	ERROR	2021-05-19 12:03:22,886 INFO: no action. i am the leader with the lock
2021-05-19T12:03:23.038097189Z	ERROR	2021-05-19 12:03:22,780 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:03:33.018484539Z	ERROR	2021-05-19 12:03:32,896 INFO: no action. i am the leader with the lock
2021-05-19T12:03:33.018484539Z	ERROR	2021-05-19 12:03:32,781 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:03:43.100396115Z	ERROR	2021-05-19 12:03:42,889 INFO: no action. i am the leader with the lock
2021-05-19T12:03:43.100396115Z	ERROR	2021-05-19 12:03:42,781 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:03:53.050433677Z	ERROR	2021-05-19 12:03:52,891 INFO: no action. i am the leader with the lock
2021-05-19T12:03:53.050433677Z	ERROR	2021-05-19 12:03:52,781 INFO: Lock owner: <old-primary-pod-name>; I am <old-primary-pod-name>
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:56 UTC 2021 INFO: Node <new-pod-name>v fully initialized for cluster <cluster-name> and is ready for use[0m
2021-05-19T12:10:59.066004638Z	INFO	/tmp:5432 - accepting connections
2021-05-19T12:10:59.066004638Z	INFO	/tmp:5432 - rejecting connections
2021-05-19T12:10:59.066004638Z	INFO	/tmp:5432 - rejecting connections
2021-05-19T12:10:59.066004638Z	INFO	/tmp:5432 - no response
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Running Patroni as PID 1[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Initializing cluster bootstrap with command: '/usr/local/bin/patroni /tmp/postgres-ha-bootstrap.yaml'[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Now removing "pause" key from patroni.dynamic.json configuration file if present[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Starting background process to monitor Patroni initization and restart the database if needed[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Starting SSHD..[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: setting up .ssh directory[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Checking for sshd_config in /sshd[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Checking for authorized_keys in /sshd[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Checking for SSH Host Keys in /sshd..[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Applying SSHD..[0m
2021-05-19T12:10:59.066004638Z	INFO	- host all all 0.0.0.0/0 md5
2021-05-19T12:10:59.066004638Z	INFO	- host all primaryuser 0.0.0.0/0 reject
2021-05-19T12:10:59.066004638Z	INFO	- host replication primaryuser 0.0.0.0/0 md5
2021-05-19T12:10:59.066004638Z	INFO	- local all postgres peer
2021-05-19T12:10:59.066004638Z	INFO	pg_hba:
2021-05-19T12:10:59.066004638Z	INFO	on_role_change: /opt/crunchy/bin/postgres-ha/callbacks/pgha-on-role-change.sh
2021-05-19T12:10:59.066004638Z	INFO	callbacks:
2021-05-19T12:10:59.066004638Z	INFO	remove_data_directory_on_rewind_failure: true
2021-05-19T12:10:59.066004638Z	INFO	no_master: 1
2021-05-19T12:10:59.066004638Z	INFO	no_params: true
2021-05-19T12:10:59.066004638Z	INFO	keep_data: true
2021-05-19T12:10:59.066004638Z	INFO	standby'
2021-05-19T12:10:59.066004638Z	INFO	command: '/opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-create-replica.sh
2021-05-19T12:10:59.066004638Z	INFO	pgbackrest_standby:
2021-05-19T12:10:59.066004638Z	INFO	no_params: true
2021-05-19T12:10:59.066004638Z	INFO	keep_data: true
2021-05-19T12:10:59.066004638Z	INFO	replica'
2021-05-19T12:10:59.066004638Z	INFO	command: '/opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-create-replica.sh
2021-05-19T12:10:59.066004638Z	INFO	pgbackrest:
2021-05-19T12:10:59.066004638Z	INFO	- basebackup
2021-05-19T12:10:59.066004638Z	INFO	- pgbackrest
2021-05-19T12:10:59.066004638Z	INFO	create_replica_methods:
2021-05-19T12:10:59.066004638Z	INFO	pgpass: /tmp/.pgpass
2021-05-19T12:10:59.066004638Z	INFO	use_unix_socket: true
2021-05-19T12:10:59.066004638Z	INFO	postgresql:
2021-05-19T12:10:59.066004638Z	INFO	post_bootstrap: /opt/crunchy/bin/postgres-ha/bootstrap/post-bootstrap.sh
2021-05-19T12:10:59.066004638Z	INFO	synchronous_mode: true
2021-05-19T12:10:59.066004638Z	INFO	&& pgbackrest archive-get %f "%p"'
2021-05-19T12:10:59.066004638Z	INFO	restore_command: 'source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-set-env.sh
2021-05-19T12:10:59.066004638Z	INFO	recovery_conf:
2021-05-19T12:10:59.066004638Z	INFO	use_slots: false
2021-05-19T12:10:59.066004638Z	INFO	synchronous_standby_names: "*"
2021-05-19T12:10:59.066004638Z	INFO	synchronous_commit: "on"
2021-05-19T12:10:59.066004638Z	INFO	&& pgbackrest archive-push "%p"'
2021-05-19T12:10:59.066004638Z	INFO	archive_command: 'source /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-set-env.sh
2021-05-19T12:10:59.066004638Z	INFO	archive_mode: on
2021-05-19T12:10:59.066004638Z	INFO	wal_level: logical
2021-05-19T12:10:59.066004638Z	INFO	unix_socket_directories: /tmp
2021-05-19T12:10:59.066004638Z	INFO	jit: off
2021-05-19T12:10:59.066004638Z	INFO	parameters:
2021-05-19T12:10:59.066004638Z	INFO	postgresql:
2021-05-19T12:10:59.066004638Z	INFO	dcs:
2021-05-19T12:10:59.066004638Z	INFO	keep_existing_recovery_conf: true
2021-05-19T12:10:59.066004638Z	INFO	command: '/opt/crunchy/bin/postgres-ha/bootstrap/create-from-existing.sh'
2021-05-19T12:10:59.066004638Z	INFO	existing_init:
2021-05-19T12:10:59.066004638Z	INFO	keep_existing_recovery_conf: true
2021-05-19T12:10:59.066004638Z	INFO	primary'
2021-05-19T12:10:59.066004638Z	INFO	command: '/opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-create-replica.sh
2021-05-19T12:10:59.066004638Z	INFO	pgbackrest_init:
2021-05-19T12:10:59.066004638Z	INFO	method: existing_init
2021-05-19T12:10:59.066004638Z	INFO	bootstrap:
2021-05-19T12:10:59.066004638Z	INFO	**
2021-05-19T12:10:59.066004638Z	INFO	Patroni configuration file:
2021-05-19T12:10:59.066004638Z	INFO	**
2021-05-19T12:10:59.066004638Z	INFO	Patroni bootstrap method: existing_init
2021-05-19T12:10:59.066004638Z	INFO	**
2021-05-19T12:10:59.066004638Z	INFO	PATRONI_KUBERNETES_NAMESPACE=<namespace>
2021-05-19T12:10:59.066004638Z	INFO	PATRONI_KUBERNETES_SCOPE_LABEL=crunchy-pgha-scope
2021-05-19T12:10:59.066004638Z	INFO	PATRONI_KUBERNETES_LABELS={vendor: "crunchydata"}
2021-05-19T12:10:59.066004638Z	INFO	PATRONI_LOG_LEVEL=INFO
2021-05-19T12:10:59.066004638Z	INFO	PATRONI_RESTAPI_CONNECT_ADDRESS=10.60.0.39:8009
2021-05-19T12:10:59.066004638Z	INFO	PATRONI_POSTGRESQL_DATA_DIR=/pgdata/<cluster-name>-ezzy
2021-05-19T12:10:59.066004638Z	INFO	PATRONI_RESTAPI_LISTEN=0.0.0.0:8009
2021-05-19T12:10:59.066004638Z	INFO	PATRONI_SCOPE=<cluster-name>
2021-05-19T12:10:59.066004638Z	INFO	PATRONI_NAME=<new-pod-name>v
2021-05-19T12:10:59.066004638Z	INFO	PATRONI_POSTGRESQL_LISTEN=0.0.0.0:5432
2021-05-19T12:10:59.066004638Z	INFO	PATRONI_POSTGRESQL_CONNECT_ADDRESS=10.60.0.39:5432
2021-05-19T12:10:59.066004638Z	INFO	**
2021-05-19T12:10:59.066004638Z	INFO	Patroni env vars:
2021-05-19T12:10:59.066004638Z	INFO	**
2021-05-19T12:10:59.066004638Z	INFO	PGHA_DATABASE=<db-name>
2021-05-19T12:10:59.066004638Z	INFO	PGHA_TLS_ENABLED=false
2021-05-19T12:10:59.066004638Z	INFO	PGHA_TLS_ONLY=false
2021-05-19T12:10:59.066004638Z	INFO	PGHA_STANDBY=false
2021-05-19T12:10:59.066004638Z	INFO	PGHA_PGBACKREST_S3_VERIFY_TLS=true
2021-05-19T12:10:59.066004638Z	INFO	PGHA_BASE_BOOTSTRAP_CONFIG=true
2021-05-19T12:10:59.066004638Z	INFO	PGHA_PGBACKREST=true
2021-05-19T12:10:59.066004638Z	INFO	PGHA_REPLICA_REINIT_ON_START_FAIL=true
2021-05-19T12:10:59.066004638Z	INFO	PGHA_DEFAULT_CONFIG=true
2021-05-19T12:10:59.066004638Z	INFO	PGHA_USER=postgres
2021-05-19T12:10:59.066004638Z	INFO	PGHA_SYNC_REPLICATION=true
2021-05-19T12:10:59.066004638Z	INFO	PGHA_PGBACKREST_LOCAL_S3_STORAGE=false
2021-05-19T12:10:59.066004638Z	INFO	PGHA_PG_PORT=5432
2021-05-19T12:10:59.066004638Z	INFO	PGHA_PATRONI_PORT=8009
2021-05-19T12:10:59.066004638Z	INFO	PGHA_BASE_PG_CONFIG=true
2021-05-19T12:10:59.066004638Z	INFO	**
2021-05-19T12:10:59.066004638Z	INFO	postgres-ha (PGHA) env vars:
2021-05-19T12:10:59.066004638Z	INFO	**
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: postgres-ha pre-bootstrap complete! The following configuration will be utilized to initialize [0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Finished building postgres-ha configuration file '/tmp/postgres-ha-bootstrap.yaml'[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Custom postgres-ha configuration file not detected[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Applying standard (non-TLS) remote connection configuration to pg_hba.conf[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Applying synchronous replication settings to postgres-ha configuration[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:53 UTC 2021 INFO: Applying pgbackrest config to postgres-ha configuration[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: Applying base postgres config to postgres-ha configuration[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: Applying base bootstrap config to postgres-ha configuration[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: Setting 'replicator' credentials using file system[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: Setting 'superuser' credentials using file system[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: Setting 'pguser' credentials using file system[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: Setting postgres-ha configuration for database user credentials[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: Defaults have been set for the following Patroni env vars: PATRONI_NAME, PATRONI_RESTAPI_LISTEN, PATRONI_RESTAPI_CONNECT_ADDRESS, PATRONI_POSTGRESQL_LISTEN, PATRONI_POSTGRESQL_CONNECT_ADDRESS[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: Defaults have been set for the following postgres-ha env vars: PGHA_PATRONI_PORT[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: Defaults have been set for the following postgres-ha auto-configuration env vars: PGHA_DEFAULT_CONFIG, PGHA_BASE_BOOTSTRAP_CONFIG, PGHA_BASE_PG_CONFIG[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: PGHA_PGBACKREST_LOCAL_S3_STORAGE and PGHA_PGBACKREST_INITIALIZE will be ignored if provided[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: pgBackRest auto-config disabled[0m
2021-05-19T12:10:59.066004638Z	INFO	[0;32mWed May 19 12:10:52 UTC 2021 INFO: postgres-ha pre-bootstrap starting...[0m
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:56,360 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:56,118 INFO: establishing a new patroni connection to the postgres cluster
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53.917 UTC [157] HINT: Future log output will appear in directory "pg_log".
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53.917 UTC [157] LOG: redirecting log output to logging collector process
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,880 INFO: postmaster pid=157
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53.878 UTC [157] LOG: pgaudit extension initialized
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,578 INFO: starting as a secondary
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,521 INFO: Lock owner: None; I am <new-pod-name>v
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,520 INFO: Reaped pid=151, exit status=0
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,507 INFO: Lock owner: None; I am <new-pod-name>v
2021-05-19T12:10:59.073286754Z	ERROR	Mock authentication nonce: be2659d33a39d4ead0617544ec33dd656e1a0810f7adec1977e350c8d942789b
2021-05-19T12:10:59.073286754Z	ERROR	Data page checksum version: 1
2021-05-19T12:10:59.073286754Z	ERROR	Float8 argument passing: by value
2021-05-19T12:10:59.073286754Z	ERROR	Date/time type storage: 64-bit integers
2021-05-19T12:10:59.073286754Z	ERROR	Size of a large-object chunk: 2048
2021-05-19T12:10:59.073286754Z	ERROR	Maximum size of a TOAST chunk: 1996
2021-05-19T12:10:59.073286754Z	ERROR	Maximum columns in an index: 32
2021-05-19T12:10:59.073286754Z	ERROR	Maximum length of identifiers: 64
2021-05-19T12:10:59.073286754Z	ERROR	Bytes per WAL segment: 16777216
2021-05-19T12:10:59.073286754Z	ERROR	WAL block size: 8192
2021-05-19T12:10:59.073286754Z	ERROR	Blocks per segment of large relation: 131072
2021-05-19T12:10:59.073286754Z	ERROR	Database block size: 8192
2021-05-19T12:10:59.073286754Z	ERROR	Maximum data alignment: 8
2021-05-19T12:10:59.073286754Z	ERROR	track_commit_timestamp setting: off
2021-05-19T12:10:59.073286754Z	ERROR	max_locks_per_xact setting: 64
2021-05-19T12:10:59.073286754Z	ERROR	max_prepared_xacts setting: 0
2021-05-19T12:10:59.073286754Z	ERROR	max_wal_senders setting: 6
2021-05-19T12:10:59.073286754Z	ERROR	max_worker_processes setting: 8
2021-05-19T12:10:59.073286754Z	ERROR	max_connections setting: 100
2021-05-19T12:10:59.073286754Z	ERROR	wal_log_hints setting: on
2021-05-19T12:10:59.073286754Z	ERROR	wal_level setting: logical
2021-05-19T12:10:59.073286754Z	ERROR	End-of-backup record required: no
2021-05-19T12:10:59.073286754Z	ERROR	Backup end location: 0/0
2021-05-19T12:10:59.073286754Z	ERROR	Backup start location: 0/0
2021-05-19T12:10:59.073286754Z	ERROR	Min recovery ending loc's timeline: 0
2021-05-19T12:10:59.073286754Z	ERROR	Minimum recovery ending location: 0/0
2021-05-19T12:10:59.073286754Z	ERROR	Fake LSN counter for unlogged rels: 0/3E8
2021-05-19T12:10:59.073286754Z	ERROR	Time of latest checkpoint: Wed May 19 12:04:11 2021
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's newestCommitTsXid: 0
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's oldestCommitTsXid: 0
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's oldestMulti's DB: 1
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's oldestMultiXid: 1
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's oldestActiveXID: 0
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's oldestXID's DB: 1
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's oldestXID: 479
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's NextMultiOffset: 0
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's NextMultiXactId: 1
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's NextOID: 22711
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's NextXID: 0:168407
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's full_page_writes: on
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's PrevTimeLineID: 3
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's TimeLineID: 3
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's REDO WAL file: 000000030000000C00000074
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint's REDO location: C/74000028
2021-05-19T12:10:59.073286754Z	ERROR	Latest checkpoint location: C/74000028
2021-05-19T12:10:59.073286754Z	ERROR	pg_control last modified: Wed May 19 12:04:12 2021
2021-05-19T12:10:59.073286754Z	ERROR	Database cluster state: shut down
2021-05-19T12:10:59.073286754Z	ERROR	Database system identifier: 6963238027115905180
2021-05-19T12:10:59.073286754Z	ERROR	Catalog version number: 202007201
2021-05-19T12:10:59.073286754Z	ERROR	pg_control version number: 1300
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,506 INFO: pg_controldata:
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,505 INFO: Reaped pid=149, exit status=0
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,500 INFO: Lock owner: None; I am <new-pod-name>v
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,500 WARNING: Postgresql is not running.
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,499 INFO: Reaped pid=148, exit status=0
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,491 INFO: Reaped pid=145, exit status=0
2021-05-19T12:10:59.073286754Z	ERROR	2021-05-19 12:10:53,485 INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-05-19T12:10:59.073286754Z	ERROR	WARNING: 'UsePAM no' is not supported in Fedora and may cause several problems.
2021-05-19T12:11:09.023184122Z	ERROR	2021-05-19 12:11:06,626 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:11:19.023086268Z	ERROR	2021-05-19 12:11:16,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:11:29.022071885Z	ERROR	2021-05-19 12:11:26,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:11:40.012709532Z	ERROR	2021-05-19 12:11:36,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:11:49.026812618Z	ERROR	2021-05-19 12:11:46,626 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:11:59.029442068Z	ERROR	2021-05-19 12:11:56,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:09.057492951Z	ERROR	2021-05-19 12:12:06,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:19.058773010Z	ERROR	2021-05-19 12:12:16,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:29.029192144Z	ERROR	2021-05-19 12:12:26,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:39.027054871Z	ERROR	2021-05-19 12:12:36,628 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:49.039960984Z	ERROR	2021-05-19 12:12:46,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:12:59.024903532Z	ERROR	2021-05-19 12:12:56,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:09.027837323Z	ERROR	2021-05-19 12:13:06,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:19.051154421Z	ERROR	2021-05-19 12:13:16,626 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:29.026947206Z	ERROR	2021-05-19 12:13:26,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:39.028711352Z	ERROR	2021-05-19 12:13:36,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:49.027312740Z	ERROR	2021-05-19 12:13:46,627 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:13:59.030673019Z	ERROR	2021-05-19 12:13:56,627 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:09.028556846Z	ERROR	2021-05-19 12:14:06,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:19.029514504Z	ERROR	2021-05-19 12:14:16,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:29.030810524Z	ERROR	2021-05-19 12:14:26,624 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:39.031587907Z	ERROR	2021-05-19 12:14:36,626 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:49.031315841Z	ERROR	2021-05-19 12:14:46,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:14:59.054734460Z	ERROR	2021-05-19 12:14:56,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:09.063950025Z	ERROR	2021-05-19 12:15:06,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:19.031738630Z	ERROR	2021-05-19 12:15:16,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:24.047773812Z	INFO	time="2021-05-19T12:15:16Z" level=info msg="Starting Server: :9187" source="postgres_exporter.go:1672"
2021-05-19T12:15:24.047773812Z	INFO	time="2021-05-19T12:15:16Z" level=info msg="Semantic Version Changed on \"127.0.0.1:5432\": 0.0.0 -> 13.2.0" source="postgres_exporter.go:1405"
2021-05-19T12:15:24.047773812Z	INFO	time="2021-05-19T12:15:16Z" level=info msg="Established new database connection to \"127.0.0.1:5432\"." source="postgres_exporter.go:878"
2021-05-19T12:15:24.047773812Z	INFO	[0;32mWed May 19 12:15:16 UTC 2021 INFO: Starting postgres-exporter..[0m
2021-05-19T12:15:24.047773812Z	INFO	[0;32mWed May 19 12:15:15 UTC 2021 INFO: No custom queries detected. Applying default configuration..[0m
2021-05-19T12:15:24.047773812Z	INFO	(1 row)
2021-05-19T12:15:24.047773812Z	INFO	1
2021-05-19T12:15:24.047773812Z	INFO	----------
2021-05-19T12:15:24.047773812Z	INFO	?column?
2021-05-19T12:15:24.047773812Z	INFO	[0;32mWed May 19 12:15:15 UTC 2021 INFO: Checking if ccp_monitoring is is created..[0m
2021-05-19T12:15:24.047773812Z	INFO	[0;32mWed May 19 12:15:15 UTC 2021 INFO: Waiting for PostgreSQL to be ready..[0m
2021-05-19T12:15:29.034298982Z	ERROR	2021-05-19 12:15:26,627 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:39.038424421Z	ERROR	2021-05-19 12:15:36,625 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:49.030632111Z	ERROR	2021-05-19 12:15:46,627 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:15:59.044899395Z	ERROR	2021-05-19 12:15:56,625 INFO: following a different leader because i am not the healthiest node

As we can see the primary node has been shut down around 14:03 pm and a new node has been created but it doesn't consider itsefl as the healthiest node.

Replica logs

receiveTimestamp	severity	textPayload
2021-05-19T12:03:07.017325622Z	ERROR	2021-05-19 12:03:02,896 INFO: no action. i am a secondary and i am following a leader
2021-05-19T12:03:07.017325622Z	ERROR	2021-05-19 12:03:02,834 INFO: does not have lock
2021-05-19T12:03:07.017325622Z	ERROR	2021-05-19 12:03:02,834 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:03:17.040610373Z	ERROR	2021-05-19 12:03:12,900 INFO: no action. i am a secondary and i am following a leader
2021-05-19T12:03:17.040610373Z	ERROR	2021-05-19 12:03:12,831 INFO: does not have lock
2021-05-19T12:03:17.040610373Z	ERROR	2021-05-19 12:03:12,831 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:03:27.021520047Z	ERROR	2021-05-19 12:03:22,888 INFO: no action. i am a secondary and i am following a leader
2021-05-19T12:03:27.021520047Z	ERROR	2021-05-19 12:03:22,832 INFO: does not have lock
2021-05-19T12:03:27.021520047Z	ERROR	2021-05-19 12:03:22,832 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:03:37.011061049Z	ERROR	2021-05-19 12:03:32,913 INFO: no action. i am a secondary and i am following a leader
2021-05-19T12:03:37.011061049Z	ERROR	2021-05-19 12:03:32,837 INFO: does not have lock
2021-05-19T12:03:37.011061049Z	ERROR	2021-05-19 12:03:32,836 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:03:47.009963801Z	ERROR	2021-05-19 12:03:42,881 INFO: no action. i am a secondary and i am following a leader
2021-05-19T12:03:47.009963801Z	ERROR	2021-05-19 12:03:42,828 INFO: does not have lock
2021-05-19T12:03:47.009963801Z	ERROR	2021-05-19 12:03:42,828 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:03:57.054579573Z	ERROR	2021-05-19 12:03:52,882 INFO: no action. i am a secondary and i am following a leader
2021-05-19T12:03:57.054579573Z	ERROR	2021-05-19 12:03:52,830 INFO: does not have lock
2021-05-19T12:03:57.054579573Z	ERROR	2021-05-19 12:03:52,830 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:04:07.017100663Z	ERROR	2021-05-19 12:04:02,897 INFO: no action. i am a secondary and i am following a leader
2021-05-19T12:04:07.017100663Z	ERROR	2021-05-19 12:04:02,829 INFO: does not have lock
2021-05-19T12:04:07.017100663Z	ERROR	2021-05-19 12:04:02,829 INFO: Lock owner: <primary-pod-name>; I am <replica-pod-name>
2021-05-19T12:04:17.021264686Z	INFO	server signaled
2021-05-19T12:04:17.232277232Z	ERROR	2021-05-19 12:04:12,392 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:04:17.232277232Z	ERROR	2021-05-19 12:04:12,333 INFO: Reaped pid=1493, exit status=0
2021-05-19T12:04:17.232277232Z	ERROR	2021-05-19 12:04:12,306 INFO: My wal position exceeds maximum replication lag
2021-05-19T12:04:27.060252771Z	ERROR	2021-05-19 12:04:22,813 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:04:27.060252771Z	ERROR	2021-05-19 12:04:22,806 INFO: My wal position exceeds maximum replication lag
2021-05-19T12:04:37.073647491Z	ERROR	2021-05-19 12:04:32,812 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:04:37.073647491Z	ERROR	2021-05-19 12:04:32,806 INFO: My wal position exceeds maximum replication lag
2021-05-19T12:04:47.020805261Z	ERROR	2021-05-19 12:04:42,812 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:04:47.020805261Z	ERROR	2021-05-19 12:04:42,806 INFO: My wal position exceeds maximum replication lag
2021-05-19T12:04:57.033420746Z	ERROR	2021-05-19 12:04:52,817 INFO: following a different leader because i am not the healthiest node
2021-05-19T12:04:57.033420746Z	ERROR	2021-05-19 12:04:52,809 INFO: My wal position exceeds maximum replication lag

As we can see, the replica started to say My wal position exceeds maximum replication lag just after the node has been killed. And it's still saying the same thing 1 hour later.

Doing a manual failover is resolving the issue. But I would like to know how can I avoid this kind of scenario ? Why does the replica did not took the primary role ?

In case this is a "normal" scenario, isn't it a way to configure the maximum replication lag in order to be less strict ?

Thanks for reading and thank you in advance for any advice.

CrunchyData / postgres-operator