Panfactum / stack

The Panfactum Stack
https://panfactum.com
Other
14 stars 5 forks source link

[question]: PG replica instance waiting for the instances to become active #99

Closed wesbragagt closed 3 months ago

wesbragagt commented 3 months ago

Prior Search

What is your question?

I noticed that the primary instance of my pg cluster is in a pending initializing state where it cannot find a WAL file in s3 {"level":"info","ts":"2024-08-08T15:49:51Z","logger":"wal-restore","msg":"WAL file not found in the recovery object store","logging_pod":"pg-88cc-1","walName":"0000000500000008 │ │ 000000C2","options":["--cloud-provider","aws-s3","s3://implentio-pg-88cc-backup-64c632f1cc08906d/","pg-88cc"],"startTime":"2024-08-08T15:49:50Z","endTime":"2024-08-08T15:49:51Z","elapse │ │ dWalTime":0.416124562}

I've followed the recommendation of extracting the cluster status info

~/dev/work/implentio/infrastructure-v2 (main ✗) kubectl cnpg status -v pg-88cc -n implentio
Cluster Summary
Name:                pg-88cc
Namespace:           implentio
System ID:           7398708564887441431
PostgreSQL Image:    ghcr.io/cloudnative-pg/postgresql:16.3
Primary instance:    pg-88cc-2
Primary start time:  2024-08-08 01:38:45 +0000 UTC (uptime 14h31m26s)
Status:              Waiting for the instances to become active Some instances are not yet active. Please wait.
Instances:           2
Ready instances:     1
Current Write LSN:   F/563B4000 (Timeline: 8 - WAL File: 000000080000000F00000056)

PostgreSQL Configuration
archive_command = '/controller/manager wal-archive --log-destination /controller/log/postgres.json %p'
archive_mode = 'on'
archive_timeout = '5min'
cluster_name = 'pg-88cc'
dynamic_shared_memory_type = 'posix'
effective_cache_size = '5000MB'
full_page_writes = 'on'
hot_standby = 'true'
listen_addresses = '*'
log_destination = 'csvlog'
log_directory = '/controller/log'
log_filename = 'postgres'
log_rotation_age = '0'
log_rotation_size = '0'
log_truncate_on_rotation = 'false'
logging_collector = 'on'
maintenance_work_mem = '500MB'
max_connections = '150'
max_parallel_workers = '32'
max_replication_slots = '32'
max_worker_processes = '32'
port = '5432'
restart_after_crash = 'false'
shared_buffers = '2500MB'
shared_memory_type = 'mmap'
shared_preload_libraries = ''
ssl = 'on'
ssl_ca_file = '/controller/certificates/client-ca.crt'
ssl_cert_file = '/controller/certificates/server.crt'
ssl_key_file = '/controller/certificates/server.key'
ssl_max_protocol_version = 'TLSv1.3'
ssl_min_protocol_version = 'TLSv1.3'
unix_socket_directories = '/controller/run'
wal_buffers = '32MB'
wal_keep_size = '1024MB'
wal_level = 'logical'
wal_log_hints = 'on'
wal_receiver_timeout = '5s'
wal_sender_timeout = '5s'
work_mem = '17MB'
cnpg.config_sha256 = '364094cab00a0b5357db217f9a3db02842cc1fd6556f1e6b6e434c7d5d22556e'

PostgreSQL HBA Rules

#
# FIXED RULES
#

# Grant local access ('local' user map)
local all all peer map=local

# Require client certificate authentication for the streaming_replica user
hostssl postgres streaming_replica all cert
hostssl replication streaming_replica all cert
hostssl all cnpg_pooler_pgbouncer all cert

#
# USER-DEFINED RULES
#

#
# DEFAULT RULES
#
host all all all scram-sha-256

Certificates Status
Certificate Name      Expiration Date                Days Left Until Expiration
----------------      ---------------                --------------------------
pg-client-certs-3e15  2024-08-17 00:36:26 +0000 UTC  8.35
pg-server-certs-20d4  2024-08-17 00:36:26 +0000 UTC  8.35

Continuous Backup status
First Point of Recoverability:  Not Available
Working WAL archiving:          Failing
WALs waiting to be archived:    1685
Last Archived WAL:              -
Last Failed WAL:                00000002.history   @   2024-08-08T16:10:00.508687Z

Physical backups
No running physical backups found

Streaming Replication status
Not available yet

Unmanaged Replication Slot Status
No unmanaged replication slots found

Managed roles status
No roles managed

Tablespaces status
No managed tablespaces

Instances status
Name       Database Size  Current LSN  Replication role  Status             QoS        Manager Version  Node
----       -------------  -----------  ----------------  ------             ---        ---------------  ----
pg-88cc-2  5446 MB        F/563B4000   Primary           OK                 Burstable  1.23.1           ip-10-0-158-226.us-west-2.compute.internal
pg-88cc-1  -              -            -                 pod not available  Burstable  -                ip-10-0-99-215.us-west-2.compute.internal

I've extracted the logs for the instance

pg-88cc-1 postgres {"level":"info","ts":"2024-08-08T16:13:47Z","msg":"DB not available, will retry","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"33411a71-c816-485f-9571-68a926ec0c13","logging_pod":"pg-88cc-1","err":"failed to connect to `host=/controller/run user=postgres database=postgres`: server error (FATAL: the database system is starting up (SQLSTATE 57P03))"}
pg-88cc-1 postgres {"level":"info","ts":"2024-08-08T16:13:47Z","logger":"wal-restore","msg":"WAL file not found in the recovery object store","logging_pod":"pg-88cc-1","walName":"00000009.history","options":["--cloud-provider","aws-s3","s3://implentio-pg-88cc-backup-64c632f1cc08906d/","pg-88cc"],"startTime":"2024-08-08T16:13:47Z","endTime":"2024-08-08T16:13:47Z","elapsedWalTime":0.356192327}
pg-88cc-1 postgres {"level":"info","ts":"2024-08-08T16:13:48Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:13:48.499 UTC","user_name":"postgres","database_name":"postgres","process_id":"26895","connection_from":"[local]","session_id":"66b4eebc.690f","session_line_num":"1","session_start_time":"2024-08-08 16:13:48 UTC","transaction_id":"0","error_severity":"FATAL","sql_state_code":"57P03","message":"the database system is starting up","backend_type":"client backend","query_id":"0"}}
pg-88cc-1 postgres {"level":"info","ts":"2024-08-08T16:13:48Z","logger":"wal-restore","msg":"WAL file not found in the recovery object store","logging_pod":"pg-88cc-1","walName":"00000008.history","options":["--cloud-provider","aws-s3","s3://implentio-pg-88cc-backup-64c632f1cc08906d/","pg-88cc"],"startTime":"2024-08-08T16:13:48Z","endTime":"2024-08-08T16:13:48Z","elapsedWalTime":0.358369271}

What primary components of the stack does this relate to?

terraform

Code of Conduct

fullykubed commented 3 months ago

Can you provide logs from the primary?

Also do you have backups enabled? Can you provide the YAML for the cluster manifest?

A healthy system with backups enabled would have a section that looks like this:

Continuous Backup status
First Point of Recoverability:  2024-07-28T00:00:07Z
Working WAL archiving:          OK
WALs waiting to be archived:    0
Last Archived WAL:              0000018100000067000000AB   @   2024-08-08T16:01:27.885872Z
Last Failed WAL:                00000181.history           @   2024-08-08T12:36:19.907956Z

Based on what I am seeing from your logs, your WAL archiving has never been working (hence why last archived WAL is -). As a result, your system is not able to recover from a failure.

fullykubed commented 3 months ago

Also, I'd like to see the entire logs from first launch of the replica pod that isn't starting. The snippet you posted does not provide enough context as to why it is trying to do a WAL restore.

wesbragagt commented 3 months ago

Logs from initialization:

{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"setup","msg":"starting tablespace manager","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"setup","msg":"starting external server manager","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"setup","msg":"starting controller-runtime manager","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting webserver","logging_pod":"pg-88cc-1","address":":9187"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"roles_reconciler","msg":"starting up the runnable","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"roles_reconciler","msg":"skipping the RoleSynchronizer in replicas","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"roles_reconciler","msg":"setting up RoleSynchronizer loop","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting webserver","logging_pod":"pg-88cc-1","address":"localhost:8010"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting webserver","logging_pod":"pg-88cc-1","address":":8000"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/server.crt","secret":"pg-server-certs-20d4"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/server.key","secret":"pg-server-certs-20d4"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/streaming_replica.crt","secret":"pg-client-certs-3e15"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/streaming_replica.key","secret":"pg-client-certs-3e15"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/client-ca.crt","secret":"pg-client-certs-3e15"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/server-ca.crt","secret":"pg-server-certs-20d4"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Installed configuration file","logging_pod":"pg-88cc-1","pgdata":"/var/lib/postgresql/data/pgdata","filename":"pg_ident.conf"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Updated replication settings","logging_pod":"pg-88cc-1","filename":"override.conf"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Found previous run flag","logging_pod":"pg-88cc-1","filename":"/var/lib/postgresql/data/pgdata/cnpg_initialized-pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Extracting pg_controldata information","logging_pod":"pg-88cc-1","reason":"postmaster start up"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"pg_controldata","msg":"pg_control version number:            1300\nCatalog version number:               202307071\nDatabase system identifier:           7398708564887441431\nDatabase cluster state:               shut down in recovery\npg_control last modified:             Thu 08 Aug 2024 04:29:17 PM UTC\nLatest checkpoint location:           D/20005D38\nLatest checkpoint's REDO location:    D/20005D00\nLatest checkpoint's REDO WAL file:    000000080000000D00000020\nLatest checkpoint's TimeLineID:       8\nLatest checkpoint's PrevTimeLineID:   8\nLatest checkpoint's full_page_writes: on\nLatest checkpoint's NextXID:          0:3065\nLatest checkpoint's NextOID:          33843\nLatest checkpoint's NextMultiXactId:  300\nLatest checkpoint's NextMultiOffset:  926\nLatest checkpoint's oldestXID:        722\nLatest checkpoint's oldestXID's DB:   1\nLatest checkpoint's oldestActiveXID:  3065\nLatest checkpoint's oldestMultiXid:   1\nLatest checkpoint's oldestMulti's DB: 1\nLatest checkpoint's oldestCommitTsXid:0\nLatest checkpoint's newestCommitTsXid:0\nTime of latest checkpoint:            Thu 08 Aug 2024 04:26:40 AM UTC\nFake LSN counter for unlogged rels:   0/3E8\nMinimum recovery ending location:     D/21000028\nMin recovery ending loc's timeline:   8\nBackup start location:                0/0\nBackup end location:                  0/0\nEnd-of-backup record required:        no\nwal_level setting:                    logical\nwal_log_hints setting:                on\nmax_connections setting:              150\nmax_worker_processes setting:         32\nmax_wal_senders setting:              10\nmax_prepared_xacts setting:           0\nmax_locks_per_xact setting:           64\ntrack_commit_timestamp setting:       off\nMaximum data alignment:               8\nDatabase block size:                  8192\nBlocks per segment of large relation: 131072\nWAL block size:                       8192\nBytes per WAL segment:                16777216\nMaximum length of identifiers:        64\nMaximum columns in an index:          32\nMaximum size of a TOAST chunk:        1996\nSize of a large-object chunk:         2048\nDate/time type storage:               64-bit integers\nFloat8 argument passing:              by value\nData page checksum version:           0\nMock authentication nonce:            863b16d8e15e0d221fb3ac60ce9cd0be9c3dce507dfd5575febc2e8956bb948d\n","pipe":"stdout","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Instance is still down, will retry in 1 second","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"2024-08-08 16:29:28.477 UTC [26] LOG:  redirecting log output to logging collector process","pipe":"stderr","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"2024-08-08 16:29:28.477 UTC [26] HINT:  Future log output will appear in directory \"/controller/log\".","pipe":"stderr","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.477 UTC","process_id":"26","session_id":"66b4f268.1a","session_line_num":"1","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"ending log output to stderr","hint":"Future log output will go to log destination \"csvlog\".","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.477 UTC","process_id":"26","session_id":"66b4f268.1a","session_line_num":"2","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"starting PostgreSQL 16.3 (Debian 16.3-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.477 UTC","process_id":"26","session_id":"66b4f268.1a","session_line_num":"3","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv4 address \"0.0.0.0\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.477 UTC","process_id":"26","session_id":"66b4f268.1a","session_line_num":"4","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv6 address \"::\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"2024-08-08 16:29:28.477 UTC [26] LOG:  ending log output to stderr","source":"/controller/log/postgres","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"2024-08-08 16:29:28.477 UTC [26] HINT:  Future log output will go to log destination \"csvlog\".","source":"/controller/log/postgres","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.481 UTC","process_id":"26","session_id":"66b4f268.1a","session_line_num":"5","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on Unix socket \"/controller/run/.s.PGSQL.5432\"","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.487 UTC","process_id":"30","session_id":"66b4f268.1e","session_line_num":"1","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system was shut down in recovery at 2024-08-08 16:29:17 UTC","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"wal-restore","msg":"WAL file not found in the recovery object store","logging_pod":"pg-88cc-1","walName":"00000009.history","options":["--cloud-provider","aws-s3","s3://implentio-pg-88cc-backup-64c632f1cc08906d/","pg-88cc"],"startTime":"2024-08-08T16:29:28Z","endTime":"2024-08-08T16:29:28Z","elapsedWalTime":0.375684911}
wesbragagt commented 3 months ago

@fullykubed I was using the default kube_pg_cluster backup schedule which is once a week on Sundays which I don't think that triggered given the database has been bootstrapped on Monday and data loaded is not much older than a couple days.

wesbragagt commented 3 months ago

Yaml for cluster manifest:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  annotations:
    config.linkerd.io/skip-inbound-ports: "5432"
    panfactum.com/admin-role: admin-implentio-pg-88cc
    panfactum.com/db: "true"
    panfactum.com/db-type: PostgreSQL
    panfactum.com/reader-role: reader-implentio-pg-88cc
    panfactum.com/service: pg-88cc-pooler-rw.implentio
    panfactum.com/service-port: "5432"
    panfactum.com/superuser-role: superuser-implentio-pg-88cc
    panfactum.com/vault-mount: db/implentio-pg-88cc
  creationTimestamp: "2024-08-03T00:47:36Z"
  generation: 11
  labels:
    id: pg-pg-88cc-295bc33c00cc195b
    panfactum.com/environment: production
    panfactum.com/local: "false"
    panfactum.com/module: kube_pg_cluster
    panfactum.com/prevent-lifetime-eviction: "true"
    panfactum.com/region: us-west-2
    panfactum.com/root-module: implentio_db
    panfactum.com/scheduler: "true"
    panfactum.com/stack-commit: d60fefd99d4be7a50c582ee25dfcb00976762f4c
    panfactum.com/stack-version: edge.24-07-08
    panfactum.com/workload: pg-pg-88cc
  name: pg-88cc
  namespace: implentio
  resourceVersion: "43972319"
  uid: 95d5a4a3-9968-43e8-99ef-a1a0f6962963
spec:
  affinity:
    enablePodAntiAffinity: true
    podAntiAffinityType: required
    tolerations:
    - effect: NoSchedule
      key: spot
      operator: Equal
      value: "true"
    - effect: NoSchedule
      key: burstable
      operator: Equal
      value: "true"
    - effect: NoSchedule
      key: arm64
      operator: Equal
      value: "true"
    topologyKey: node.kubernetes.io/instance-type
  backup:
    barmanObjectStore:
      data:
        compression: bzip2
        jobs: 8
      destinationPath: s3://implentio-pg-88cc-backup-64c632f1cc08906d/
      s3Credentials:
        inheritFromIAMRole: true
      wal:
        compression: bzip2
        maxParallel: 8
    retentionPolicy: 7d
    target: prefer-standby
  bootstrap:
    initdb:
      database: app
      encoding: UTF8
      localeCType: C
      localeCollate: C
      owner: app
      postInitApplicationSQL:
      - REVOKE ALL ON SCHEMA public FROM PUBLIC;
      - CREATE ROLE reader NOINHERIT;
      - GRANT pg_read_all_data TO reader;
      - GRANT USAGE ON SCHEMA public TO reader;
      - CREATE ROLE writer NOINHERIT;
      - GRANT pg_write_all_data, pg_read_all_data TO writer;
      - GRANT ALL PRIVILEGES ON SCHEMA public TO writer;
      - GRANT CONNECT ON DATABASE app TO cnpg_pooler_pgbouncer;
      - GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer;
      - CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS TABLE (usename
        name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT usename, passwd
        FROM pg_shadow WHERE usename=$1;'
      - REVOKE ALL ON FUNCTION user_search(text) FROM public;
      - GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer;
      postInitSQL:
      - REVOKE ALL ON SCHEMA public FROM PUBLIC;
      - CREATE ROLE cnpg_pooler_pgbouncer WITH LOGIN;
      - GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer;
      - GRANT CONNECT ON DATABASE postgres TO cnpg_pooler_pgbouncer;
      - CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS TABLE (usename
        name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT usename, passwd
        FROM pg_shadow WHERE usename=$1;'
      - REVOKE ALL ON FUNCTION user_search(text) FROM public;
      - GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer;
  certificates:
    clientCASecret: pg-client-certs-3e15
    replicationTLSSecret: pg-client-certs-3e15
    serverCASecret: pg-server-certs-20d4
    serverTLSSecret: pg-server-certs-20d4
  enablePDB: false
  enableSuperuserAccess: true
  failoverDelay: 60
  imageName: ghcr.io/cloudnative-pg/postgresql:16.3
  inheritedMetadata:
    annotations:
      config.linkerd.io/skip-inbound-ports: "5432"
      linkerd.io/inject: enabled
      resize.topolvm.io/increase: 100%
      resize.topolvm.io/storage_limit: 200Gi
      resize.topolvm.io/threshold: 35%
    labels:
      id: pg-pg-88cc-295bc33c00cc195b
      panfactum.com/environment: production
      panfactum.com/local: "false"
      panfactum.com/module: kube_pg_cluster
      panfactum.com/prevent-lifetime-eviction: "true"
      panfactum.com/region: us-west-2
      panfactum.com/root-module: implentio_db
      panfactum.com/scheduler: "true"
      panfactum.com/stack-commit: d60fefd99d4be7a50c582ee25dfcb00976762f4c
      panfactum.com/stack-version: edge.24-07-08
      panfactum.com/workload: pg-pg-88cc
      pg-cluster: implentio-pg-88cc
  instances: 2
  logLevel: info
  maxSyncReplicas: 0
  minSyncReplicas: 0
  monitoring:
    customQueriesConfigMap:
    - key: queries
      name: cnpg-default-monitoring
    disableDefaultQueries: false
    enablePodMonitor: true
  postgresGID: 26
  postgresUID: 26
  postgresql:
    parameters:
      archive_mode: "on"
      archive_timeout: 5min
      dynamic_shared_memory_type: posix
      effective_cache_size: 5000MB
      log_destination: csvlog
      log_directory: /controller/log
      log_filename: postgres
      log_rotation_age: "0"
      log_rotation_size: "0"
      log_truncate_on_rotation: "false"
      logging_collector: "on"
      maintenance_work_mem: 500MB
      max_connections: "150"
      max_parallel_workers: "32"
      max_replication_slots: "32"
      max_worker_processes: "32"
      shared_buffers: 2500MB
      shared_memory_type: mmap
      shared_preload_libraries: ""
      ssl_max_protocol_version: TLSv1.3
      ssl_min_protocol_version: TLSv1.3
      wal_buffers: 32MB
      wal_keep_size: 1024MB
      wal_level: logical
      wal_log_hints: "on"
      wal_receiver_timeout: 5s
      wal_sender_timeout: 5s
      work_mem: 17MB
    syncReplicaElectionConstraint:
      enabled: false
  primaryUpdateMethod: switchover
  primaryUpdateStrategy: unsupervised
  priorityClassName: database
  replicationSlots:
    highAvailability:
      enabled: true
      slotPrefix: _cnpg_
    synchronizeReplicas:
      enabled: true
    updateInterval: 30
  resources:
    limits:
      memory: 13000Mi
    requests:
      cpu: "4"
      memory: 10000Mi
  schedulerName: panfactum
  serviceAccountTemplate:
    metadata:
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::590183845935:role/pg-88cc-20240803003625866900000002
  smartShutdownTimeout: 20
  startDelay: 600
  stopDelay: 30
  storage:
    pvcTemplate:
      resources:
        requests:
          storage: 20Gi
      storageClassName: ebs-standard
    resizeInUseVolumes: true
  superuserSecret:
    name: pg-88cc-superuser-e6942fc715ffdc1fbafb7ca800e7270d1dc807cee47608a13abadb97cfa09888
  switchoverDelay: 30
  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        id: pg-pg-88cc-295bc33c00cc195b
    maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
status:
  availableArchitectures:
  - goArch: amd64
    hash: 94527128605ac5100415106fe26c480531d094b3f36626e562a8135f342b89e4
  - goArch: arm64
    hash: 9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4
  certificates:
    clientCASecret: pg-client-certs-3e15
    expirations:
      pg-client-certs-3e15: 2024-08-17 00:36:26 +0000 UTC
      pg-server-certs-20d4: 2024-08-17 00:36:26 +0000 UTC
    replicationTLSSecret: pg-client-certs-3e15
    serverAltDNSNames:
    - pg-88cc-rw
    - pg-88cc-rw.implentio
    - pg-88cc-rw.implentio.svc
    - pg-88cc-r
    - pg-88cc-r.implentio
    - pg-88cc-r.implentio.svc
    - pg-88cc-ro
    - pg-88cc-ro.implentio
    - pg-88cc-ro.implentio.svc
    serverCASecret: pg-server-certs-20d4
    serverTLSSecret: pg-server-certs-20d4
  cloudNativePGCommitHash: 336ddf53
  cloudNativePGOperatorHash: 9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4
  conditions:
  - lastTransitionTime: "2024-08-08T01:38:34Z"
    message: Cluster Is Not Ready
    reason: ClusterIsNotReady
    status: "False"
    type: Ready
  - lastTransitionTime: "2024-08-03T00:48:08Z"
    message: 'unexpected failure invoking barman-cloud-wal-archive: exit status 1'
    reason: ContinuousArchivingFailing
    status: "False"
    type: ContinuousArchiving
  configMapResourceVersion:
    metrics:
      cnpg-default-monitoring: "37155643"
  currentPrimary: pg-88cc-2
  currentPrimaryTimestamp: "2024-08-08T01:38:44.770147Z"
  healthyPVC:
  - pg-88cc-1
  - pg-88cc-2
  image: ghcr.io/cloudnative-pg/postgresql:16.3
  instanceNames:
  - pg-88cc-1
  - pg-88cc-2
  instances: 2
  instancesReportedState:
    pg-88cc-1:
      isPrimary: false
    pg-88cc-2:
      isPrimary: true
      timeLineID: 8
  instancesStatus:
    healthy:
    - pg-88cc-2
    replicating:
    - pg-88cc-1
  latestGeneratedNode: 2
  managedRolesStatus: {}
  phase: Waiting for the instances to become active
  phaseReason: Some instances are not yet active. Please wait.
  poolerIntegrations:
    pgBouncerIntegration: {}
  pvcCount: 2
  readService: pg-88cc-r
  readyInstances: 1
  secretsResourceVersion:
    applicationSecretVersion: "37155621"
    clientCaSecretVersion: "37146447"
    replicationSecretVersion: "37146447"
    serverCaSecretVersion: "37146439"
    serverSecretVersion: "37146439"
    superuserSecretVersion: "37146383"
  switchReplicaClusterStatus: {}
  targetPrimary: pg-88cc-2
  targetPrimaryTimestamp: "2024-08-08T01:38:34.845444Z"
  timelineID: 8
  topology:
    instances:
      pg-88cc-1: {}
      pg-88cc-2: {}
    nodesUsed: 2
    successfullyExtracted: true
  writeService: pg-88cc-rw
fullykubed commented 3 months ago

@wesbragagt ScheduledBackups are not your issue here. Those are full database snapshots and are only used for worst-case emergency recovery.

The continuous archiving is the problem. Continuous archiving is the process of writing the WAL files to S3 as they are generated so you can do point-in-time recoveries. These files are also use to resync replica nodes when they are created to avoid putting load on the primary.

The issue that you are experiencing is that the WAL archiving is not and has never been working. As a result, cluster nodes obviously will not be able to restore from the non-existent archive.

Your next step is to figure out why the WAL archiving is not working. In order to do this, you need to look at the logs for the primary as that is where is the WAL files are generated and shipped to S3.

fullykubed commented 3 months ago

For a conceptual overview, you should review the documentation here: https://cloudnative-pg.io/documentation/current/backup/#wal-archive

wesbragagt commented 3 months ago

@fullykubed thank you. I was able to extract this error from the primary:

{"level":"info","ts":"2024-08-08T16:38:26Z","logger":"barman-cloud-check-wal-archive","msg":"2024-08-08 16:38:26,038 [57] ERROR: WAL archive check failed for server pg-88cc: Expected empty archive","pipe":"stderr","logging_pod":"pg-88cc-2"}
{"level":"error","ts":"2024-08-08T16:38:26Z","logger":"wal-archive","msg":"Error invoking barman-cloud-check-wal-archive","logging_pod":"pg-88cc-2","currentPrimary":"pg-88cc-2","targetPrimary":"pg-88cc-2","options":["--cloud-provider","aws-s3","s3://implentio-pg-88cc-backup-64c632f1cc08906d/","pg-88cc"],"exitCode":-1,"error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/barman/archiver.(*WALArchiver).CheckWalArchiveDestination\n\tpkg/management/barman/archiver/archiver.go:257\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.checkWalArchive\n\tinternal/cmd/manager/walarchive/cmd.go:417\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.run\n\tinternal/cmd/manager/walarchive/cmd.go:214\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:89\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"error","ts":"2024-08-08T16:38:26Z","msg":"while barman-cloud-check-wal-archive","logging_pod":"pg-88cc-2","error":"unexpected failure invoking barman-cloud-wal-archive: exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/log.Error\n\tpkg/management/log/log.go:163\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.checkWalArchive\n\tinternal/cmd/manager/walarchive/cmd.go:418\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.run\n\tinternal/cmd/manager/walarchive/cmd.go:214\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:89\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"error","ts":"2024-08-08T16:38:26Z","logger":"wal-archive","msg":"failed to run wal-archive command","logging_pod":"pg-88cc-2","error":"unexpected failure invoking barman-cloud-wal-archive: exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:95\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"info","ts":"2024-08-08T16:38:26Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-2","record":{"log_time":"2024-08-08 16:38:26.077 UTC","process_id":"38","session_id":"66b4f481.26","session_line_num":"1","session_start_time":"2024-08-08 16:38:25 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"archive command failed with exit code 1","detail":"The failed archive command was: /controller/manager wal-archive --log-destination /controller/log/postgres.json pg_wal/00000002.history","backend_type":"archiver","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:38:28Z","logger":"barman-cloud-check-wal-archive","msg":"2024-08-08 16:38:28,948 [123] ERROR: WAL archive check failed for server pg-88cc: Expected empty archive","pipe":"stderr","logging_pod":"pg-88cc-2"}
{"level":"error","ts":"2024-08-08T16:38:28Z","logger":"wal-archive","msg":"Error invoking barman-cloud-check-wal-archive","logging_pod":"pg-88cc-2","currentPrimary":"pg-88cc-2","targetPrimary":"pg-88cc-2","options":["--cloud-provider","aws-s3","s3://implentio-pg-88cc-backup-64c632f1cc08906d/","pg-88cc"],"exitCode":-1,"error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/barman/archiver.(*WALArchiver).CheckWalArchiveDestination\n\tpkg/management/barman/archiver/archiver.go:257\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.checkWalArchive\n\tinternal/cmd/manager/walarchive/cmd.go:417\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.run\n\tinternal/cmd/manager/walarchive/cmd.go:214\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:89\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"error","ts":"2024-08-08T16:38:28Z","msg":"while barman-cloud-check-wal-archive","logging_pod":"pg-88cc-2","error":"unexpected failure invoking barman-cloud-wal-archive: exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/log.Error\n\tpkg/management/log/log.go:163\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.checkWalArchive\n\tinternal/cmd/manager/walarchive/cmd.go:418\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.run\n\tinternal/cmd/manager/walarchive/cmd.go:214\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:89\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"error","ts":"2024-08-08T16:38:28Z","logger":"wal-archive","msg":"failed to run wal-archive command","logging_pod":"pg-88cc-2","error":"unexpected failure invoking barman-cloud-wal-archive: exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:95\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"info","ts":"2024-08-08T16:38:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-2","record":{"log_time":"2024-08-08 16:38:28.986 UTC","process_id":"38","session_id":"66b4f481.26","session_line_num":"3","session_start_time":"2024-08-08 16:38:25 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"archive command failed with exit code 1","detail":"The failed archive command was: /controller/manager wal-archive --log-destination /controller/log/postgres.json pg_wal/00000002.history","backend_type":"archiver","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:38:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-2","record":{"log_time":"2024-08-08 16:38:28.986 UTC","process_id":"38","session_id":"66b4f481.26","session_line_num":"4","session_start_time":"2024-08-08 16:38:25 UTC","transaction_id":"0","error_severity":"WARNING","sql_state_code":"01000","message":"archiving write-ahead log file \"00000002.history\" failed too many times, will try again later","backend_type":"archiver","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:38:34Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-2","record":{"log_time":"2024-08-08 16:38:34.419 UTC","user_name":"streaming_replica","process_id":"152","connection_from":"10.0.99.58:60350","session_id":"66b4f48a.98","session_line_num":"1","command_tag":"START_REPLICATION","session_start_time":"2024-08-08 16:38:34 UTC","virtual_transaction_id":"3/0","transaction_id":"0","error_severity":"ERROR","sql_state_code":"58P01","message":"requested WAL segment 0000000700000008000000B0 has already been removed","query":"START_REPLICATION SLOT \"_cnpg_pg_88cc_1\" 8/B0000000 TIMELINE 7","application_name":"pg-88cc-1","backend_type":"walsender","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:38:40Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-2","record":{"log_time":"2024-08-08 16:38:40.128 UTC","user_name":"streaming_replica","process_id":"190","connection_from":"10.0.99.58:60356","session_id":"66b4f490.be","session_line_num":"1","command_tag":"START_REPLICATION","session_start_time":"2024-08-08 16:38:40 UTC","virtual_transaction_id":"3/0","transaction_id":"0","error_severity":"ERROR","sql_state_code":"58P01","message":"requested WAL segment 0000000700000008000000B0 has already been removed","query":"START_REPLICATION SLOT \"_cnpg_pg_88cc_1\" 8/B0000000 TIMELINE 7","application_name":"pg-88cc-1","backend_type":"walsender","query_id":"0"}}
fullykubed commented 3 months ago

@wesbragagt That error means that you had previously had a Postgres cluster that was writing to this S3 bucket before you created this current Postgres cluster.

Perhaps you ran a partial destroy of the module or deleted the Postgres cluster directly via k9s? Does that sound familiar?

A safety feature that cnpg provides is that it won't overwrite existing WAL files if you accidentally point a cluster to an S3 bucket with existing WAL files (as this would result in data loss for the original owner of the bucket).

You need to empty the S3 bucket and then restart your primary. The WAL archiving process should then continue successfully.

You can then try to restart the replica.

wesbragagt commented 3 months ago

@fullykubed Just to confirm, this action would not cause any data loss correct? And I did have a partial apply when attempting to scale up CPU at one point this morning.

fullykubed commented 3 months ago

@wesbragagt I cannot say definitively without knowing what system generated the files that are currently in your bucket. You need to determine what system owns those. If it is a system you previously removed and have no need for the data, you can empty the bucket without impacting the currently running system.

However, what I can say is that your current live database (a) does not own those files and (b) is not backed up in S3 currently.

wesbragagt commented 3 months ago

Closing the issue as @fullykubed was able to instruct me the following manual steps to recover:

  1. Identify the cluster name and backup kubectl cnpg backup -n <namespace> <cluster-name> wait til the backup is done
  2. Identify the replica pod that is stuck an unable to find WAL file in s3
  3. Given the primary instance is working, scale down to 1 instance and terragrunt apply
  4. Delete the one replica (NOT THE PRIMARY) pod pvc
  5. Delete the one replica (NOT THE PRIMARY) pod pv
  6. Delete the pod
  7. Make sure the apply is successful as it will be in a pending state until the replica pod is destroyed
  8. Make sure that the primary instance WAL files are working properly
  9. Scale up the infrastructure back to 2 instances and terragrunt apply
  10. Everything should be in a stable state now