Closed wesbragagt closed 3 months ago
Can you provide logs from the primary?
Also do you have backups enabled? Can you provide the YAML for the cluster manifest?
A healthy system with backups enabled would have a section that looks like this:
Continuous Backup status
First Point of Recoverability: 2024-07-28T00:00:07Z
Working WAL archiving: OK
WALs waiting to be archived: 0
Last Archived WAL: 0000018100000067000000AB @ 2024-08-08T16:01:27.885872Z
Last Failed WAL: 00000181.history @ 2024-08-08T12:36:19.907956Z
Based on what I am seeing from your logs, your WAL archiving has never been working (hence why last archived WAL is -
). As a result, your system is not able to recover from a failure.
Also, I'd like to see the entire logs from first launch of the replica pod that isn't starting. The snippet you posted does not provide enough context as to why it is trying to do a WAL restore.
Logs from initialization:
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"setup","msg":"starting tablespace manager","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"setup","msg":"starting external server manager","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"setup","msg":"starting controller-runtime manager","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting webserver","logging_pod":"pg-88cc-1","address":":9187"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"roles_reconciler","msg":"starting up the runnable","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"roles_reconciler","msg":"skipping the RoleSynchronizer in replicas","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"roles_reconciler","msg":"setting up RoleSynchronizer loop","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting webserver","logging_pod":"pg-88cc-1","address":"localhost:8010"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting webserver","logging_pod":"pg-88cc-1","address":":8000"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting EventSource","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","source":"kind source: *v1.Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting Controller","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/server.crt","secret":"pg-server-certs-20d4"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/server.key","secret":"pg-server-certs-20d4"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/streaming_replica.crt","secret":"pg-client-certs-3e15"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/streaming_replica.key","secret":"pg-client-certs-3e15"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/client-ca.crt","secret":"pg-client-certs-3e15"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Refreshed configuration file","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1","filename":"/controller/certificates/server-ca.crt","secret":"pg-server-certs-20d4"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Installed configuration file","logging_pod":"pg-88cc-1","pgdata":"/var/lib/postgresql/data/pgdata","filename":"pg_ident.conf"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Updated replication settings","logging_pod":"pg-88cc-1","filename":"override.conf"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Found previous run flag","logging_pod":"pg-88cc-1","filename":"/var/lib/postgresql/data/pgdata/cnpg_initialized-pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Extracting pg_controldata information","logging_pod":"pg-88cc-1","reason":"postmaster start up"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"pg_controldata","msg":"pg_control version number: 1300\nCatalog version number: 202307071\nDatabase system identifier: 7398708564887441431\nDatabase cluster state: shut down in recovery\npg_control last modified: Thu 08 Aug 2024 04:29:17 PM UTC\nLatest checkpoint location: D/20005D38\nLatest checkpoint's REDO location: D/20005D00\nLatest checkpoint's REDO WAL file: 000000080000000D00000020\nLatest checkpoint's TimeLineID: 8\nLatest checkpoint's PrevTimeLineID: 8\nLatest checkpoint's full_page_writes: on\nLatest checkpoint's NextXID: 0:3065\nLatest checkpoint's NextOID: 33843\nLatest checkpoint's NextMultiXactId: 300\nLatest checkpoint's NextMultiOffset: 926\nLatest checkpoint's oldestXID: 722\nLatest checkpoint's oldestXID's DB: 1\nLatest checkpoint's oldestActiveXID: 3065\nLatest checkpoint's oldestMultiXid: 1\nLatest checkpoint's oldestMulti's DB: 1\nLatest checkpoint's oldestCommitTsXid:0\nLatest checkpoint's newestCommitTsXid:0\nTime of latest checkpoint: Thu 08 Aug 2024 04:26:40 AM UTC\nFake LSN counter for unlogged rels: 0/3E8\nMinimum recovery ending location: D/21000028\nMin recovery ending loc's timeline: 8\nBackup start location: 0/0\nBackup end location: 0/0\nEnd-of-backup record required: no\nwal_level setting: logical\nwal_log_hints setting: on\nmax_connections setting: 150\nmax_worker_processes setting: 32\nmax_wal_senders setting: 10\nmax_prepared_xacts setting: 0\nmax_locks_per_xact setting: 64\ntrack_commit_timestamp setting: off\nMaximum data alignment: 8\nDatabase block size: 8192\nBlocks per segment of large relation: 131072\nWAL block size: 8192\nBytes per WAL segment: 16777216\nMaximum length of identifiers: 64\nMaximum columns in an index: 32\nMaximum size of a TOAST chunk: 1996\nSize of a large-object chunk: 2048\nDate/time type storage: 64-bit integers\nFloat8 argument passing: by value\nData page checksum version: 0\nMock authentication nonce: 863b16d8e15e0d221fb3ac60ce9cd0be9c3dce507dfd5575febc2e8956bb948d\n","pipe":"stdout","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","msg":"Instance is still down, will retry in 1 second","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg-88cc","namespace":"implentio"},"namespace":"implentio","name":"pg-88cc","reconcileID":"8e50668a-5664-4fbd-b9f2-d3fb5e35fd28","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"2024-08-08 16:29:28.477 UTC [26] LOG: redirecting log output to logging collector process","pipe":"stderr","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"2024-08-08 16:29:28.477 UTC [26] HINT: Future log output will appear in directory \"/controller/log\".","pipe":"stderr","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.477 UTC","process_id":"26","session_id":"66b4f268.1a","session_line_num":"1","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"ending log output to stderr","hint":"Future log output will go to log destination \"csvlog\".","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.477 UTC","process_id":"26","session_id":"66b4f268.1a","session_line_num":"2","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"starting PostgreSQL 16.3 (Debian 16.3-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.477 UTC","process_id":"26","session_id":"66b4f268.1a","session_line_num":"3","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv4 address \"0.0.0.0\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.477 UTC","process_id":"26","session_id":"66b4f268.1a","session_line_num":"4","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv6 address \"::\", port 5432","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"2024-08-08 16:29:28.477 UTC [26] LOG: ending log output to stderr","source":"/controller/log/postgres","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"2024-08-08 16:29:28.477 UTC [26] HINT: Future log output will go to log destination \"csvlog\".","source":"/controller/log/postgres","logging_pod":"pg-88cc-1"}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.481 UTC","process_id":"26","session_id":"66b4f268.1a","session_line_num":"5","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on Unix socket \"/controller/run/.s.PGSQL.5432\"","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-1","record":{"log_time":"2024-08-08 16:29:28.487 UTC","process_id":"30","session_id":"66b4f268.1e","session_line_num":"1","session_start_time":"2024-08-08 16:29:28 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system was shut down in recovery at 2024-08-08 16:29:17 UTC","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:29:28Z","logger":"wal-restore","msg":"WAL file not found in the recovery object store","logging_pod":"pg-88cc-1","walName":"00000009.history","options":["--cloud-provider","aws-s3","s3://implentio-pg-88cc-backup-64c632f1cc08906d/","pg-88cc"],"startTime":"2024-08-08T16:29:28Z","endTime":"2024-08-08T16:29:28Z","elapsedWalTime":0.375684911}
@fullykubed I was using the default kube_pg_cluster backup schedule which is once a week on Sundays which I don't think that triggered given the database has been bootstrapped on Monday and data loaded is not much older than a couple days.
Yaml for cluster manifest:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
annotations:
config.linkerd.io/skip-inbound-ports: "5432"
panfactum.com/admin-role: admin-implentio-pg-88cc
panfactum.com/db: "true"
panfactum.com/db-type: PostgreSQL
panfactum.com/reader-role: reader-implentio-pg-88cc
panfactum.com/service: pg-88cc-pooler-rw.implentio
panfactum.com/service-port: "5432"
panfactum.com/superuser-role: superuser-implentio-pg-88cc
panfactum.com/vault-mount: db/implentio-pg-88cc
creationTimestamp: "2024-08-03T00:47:36Z"
generation: 11
labels:
id: pg-pg-88cc-295bc33c00cc195b
panfactum.com/environment: production
panfactum.com/local: "false"
panfactum.com/module: kube_pg_cluster
panfactum.com/prevent-lifetime-eviction: "true"
panfactum.com/region: us-west-2
panfactum.com/root-module: implentio_db
panfactum.com/scheduler: "true"
panfactum.com/stack-commit: d60fefd99d4be7a50c582ee25dfcb00976762f4c
panfactum.com/stack-version: edge.24-07-08
panfactum.com/workload: pg-pg-88cc
name: pg-88cc
namespace: implentio
resourceVersion: "43972319"
uid: 95d5a4a3-9968-43e8-99ef-a1a0f6962963
spec:
affinity:
enablePodAntiAffinity: true
podAntiAffinityType: required
tolerations:
- effect: NoSchedule
key: spot
operator: Equal
value: "true"
- effect: NoSchedule
key: burstable
operator: Equal
value: "true"
- effect: NoSchedule
key: arm64
operator: Equal
value: "true"
topologyKey: node.kubernetes.io/instance-type
backup:
barmanObjectStore:
data:
compression: bzip2
jobs: 8
destinationPath: s3://implentio-pg-88cc-backup-64c632f1cc08906d/
s3Credentials:
inheritFromIAMRole: true
wal:
compression: bzip2
maxParallel: 8
retentionPolicy: 7d
target: prefer-standby
bootstrap:
initdb:
database: app
encoding: UTF8
localeCType: C
localeCollate: C
owner: app
postInitApplicationSQL:
- REVOKE ALL ON SCHEMA public FROM PUBLIC;
- CREATE ROLE reader NOINHERIT;
- GRANT pg_read_all_data TO reader;
- GRANT USAGE ON SCHEMA public TO reader;
- CREATE ROLE writer NOINHERIT;
- GRANT pg_write_all_data, pg_read_all_data TO writer;
- GRANT ALL PRIVILEGES ON SCHEMA public TO writer;
- GRANT CONNECT ON DATABASE app TO cnpg_pooler_pgbouncer;
- GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer;
- CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS TABLE (usename
name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT usename, passwd
FROM pg_shadow WHERE usename=$1;'
- REVOKE ALL ON FUNCTION user_search(text) FROM public;
- GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer;
postInitSQL:
- REVOKE ALL ON SCHEMA public FROM PUBLIC;
- CREATE ROLE cnpg_pooler_pgbouncer WITH LOGIN;
- GRANT ALL PRIVILEGES ON SCHEMA public TO cnpg_pooler_pgbouncer;
- GRANT CONNECT ON DATABASE postgres TO cnpg_pooler_pgbouncer;
- CREATE OR REPLACE FUNCTION user_search(uname TEXT) RETURNS TABLE (usename
name, passwd text) LANGUAGE sql SECURITY DEFINER AS 'SELECT usename, passwd
FROM pg_shadow WHERE usename=$1;'
- REVOKE ALL ON FUNCTION user_search(text) FROM public;
- GRANT EXECUTE ON FUNCTION user_search(text) TO cnpg_pooler_pgbouncer;
certificates:
clientCASecret: pg-client-certs-3e15
replicationTLSSecret: pg-client-certs-3e15
serverCASecret: pg-server-certs-20d4
serverTLSSecret: pg-server-certs-20d4
enablePDB: false
enableSuperuserAccess: true
failoverDelay: 60
imageName: ghcr.io/cloudnative-pg/postgresql:16.3
inheritedMetadata:
annotations:
config.linkerd.io/skip-inbound-ports: "5432"
linkerd.io/inject: enabled
resize.topolvm.io/increase: 100%
resize.topolvm.io/storage_limit: 200Gi
resize.topolvm.io/threshold: 35%
labels:
id: pg-pg-88cc-295bc33c00cc195b
panfactum.com/environment: production
panfactum.com/local: "false"
panfactum.com/module: kube_pg_cluster
panfactum.com/prevent-lifetime-eviction: "true"
panfactum.com/region: us-west-2
panfactum.com/root-module: implentio_db
panfactum.com/scheduler: "true"
panfactum.com/stack-commit: d60fefd99d4be7a50c582ee25dfcb00976762f4c
panfactum.com/stack-version: edge.24-07-08
panfactum.com/workload: pg-pg-88cc
pg-cluster: implentio-pg-88cc
instances: 2
logLevel: info
maxSyncReplicas: 0
minSyncReplicas: 0
monitoring:
customQueriesConfigMap:
- key: queries
name: cnpg-default-monitoring
disableDefaultQueries: false
enablePodMonitor: true
postgresGID: 26
postgresUID: 26
postgresql:
parameters:
archive_mode: "on"
archive_timeout: 5min
dynamic_shared_memory_type: posix
effective_cache_size: 5000MB
log_destination: csvlog
log_directory: /controller/log
log_filename: postgres
log_rotation_age: "0"
log_rotation_size: "0"
log_truncate_on_rotation: "false"
logging_collector: "on"
maintenance_work_mem: 500MB
max_connections: "150"
max_parallel_workers: "32"
max_replication_slots: "32"
max_worker_processes: "32"
shared_buffers: 2500MB
shared_memory_type: mmap
shared_preload_libraries: ""
ssl_max_protocol_version: TLSv1.3
ssl_min_protocol_version: TLSv1.3
wal_buffers: 32MB
wal_keep_size: 1024MB
wal_level: logical
wal_log_hints: "on"
wal_receiver_timeout: 5s
wal_sender_timeout: 5s
work_mem: 17MB
syncReplicaElectionConstraint:
enabled: false
primaryUpdateMethod: switchover
primaryUpdateStrategy: unsupervised
priorityClassName: database
replicationSlots:
highAvailability:
enabled: true
slotPrefix: _cnpg_
synchronizeReplicas:
enabled: true
updateInterval: 30
resources:
limits:
memory: 13000Mi
requests:
cpu: "4"
memory: 10000Mi
schedulerName: panfactum
serviceAccountTemplate:
metadata:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::590183845935:role/pg-88cc-20240803003625866900000002
smartShutdownTimeout: 20
startDelay: 600
stopDelay: 30
storage:
pvcTemplate:
resources:
requests:
storage: 20Gi
storageClassName: ebs-standard
resizeInUseVolumes: true
superuserSecret:
name: pg-88cc-superuser-e6942fc715ffdc1fbafb7ca800e7270d1dc807cee47608a13abadb97cfa09888
switchoverDelay: 30
topologySpreadConstraints:
- labelSelector:
matchLabels:
id: pg-pg-88cc-295bc33c00cc195b
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
status:
availableArchitectures:
- goArch: amd64
hash: 94527128605ac5100415106fe26c480531d094b3f36626e562a8135f342b89e4
- goArch: arm64
hash: 9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4
certificates:
clientCASecret: pg-client-certs-3e15
expirations:
pg-client-certs-3e15: 2024-08-17 00:36:26 +0000 UTC
pg-server-certs-20d4: 2024-08-17 00:36:26 +0000 UTC
replicationTLSSecret: pg-client-certs-3e15
serverAltDNSNames:
- pg-88cc-rw
- pg-88cc-rw.implentio
- pg-88cc-rw.implentio.svc
- pg-88cc-r
- pg-88cc-r.implentio
- pg-88cc-r.implentio.svc
- pg-88cc-ro
- pg-88cc-ro.implentio
- pg-88cc-ro.implentio.svc
serverCASecret: pg-server-certs-20d4
serverTLSSecret: pg-server-certs-20d4
cloudNativePGCommitHash: 336ddf53
cloudNativePGOperatorHash: 9b7b08592e917ed3b20bb3ae404ea4c0c958bdee73e5411c452d6c464d77f0b4
conditions:
- lastTransitionTime: "2024-08-08T01:38:34Z"
message: Cluster Is Not Ready
reason: ClusterIsNotReady
status: "False"
type: Ready
- lastTransitionTime: "2024-08-03T00:48:08Z"
message: 'unexpected failure invoking barman-cloud-wal-archive: exit status 1'
reason: ContinuousArchivingFailing
status: "False"
type: ContinuousArchiving
configMapResourceVersion:
metrics:
cnpg-default-monitoring: "37155643"
currentPrimary: pg-88cc-2
currentPrimaryTimestamp: "2024-08-08T01:38:44.770147Z"
healthyPVC:
- pg-88cc-1
- pg-88cc-2
image: ghcr.io/cloudnative-pg/postgresql:16.3
instanceNames:
- pg-88cc-1
- pg-88cc-2
instances: 2
instancesReportedState:
pg-88cc-1:
isPrimary: false
pg-88cc-2:
isPrimary: true
timeLineID: 8
instancesStatus:
healthy:
- pg-88cc-2
replicating:
- pg-88cc-1
latestGeneratedNode: 2
managedRolesStatus: {}
phase: Waiting for the instances to become active
phaseReason: Some instances are not yet active. Please wait.
poolerIntegrations:
pgBouncerIntegration: {}
pvcCount: 2
readService: pg-88cc-r
readyInstances: 1
secretsResourceVersion:
applicationSecretVersion: "37155621"
clientCaSecretVersion: "37146447"
replicationSecretVersion: "37146447"
serverCaSecretVersion: "37146439"
serverSecretVersion: "37146439"
superuserSecretVersion: "37146383"
switchReplicaClusterStatus: {}
targetPrimary: pg-88cc-2
targetPrimaryTimestamp: "2024-08-08T01:38:34.845444Z"
timelineID: 8
topology:
instances:
pg-88cc-1: {}
pg-88cc-2: {}
nodesUsed: 2
successfullyExtracted: true
writeService: pg-88cc-rw
@wesbragagt ScheduledBackups are not your issue here. Those are full database snapshots and are only used for worst-case emergency recovery.
The continuous archiving is the problem. Continuous archiving is the process of writing the WAL files to S3 as they are generated so you can do point-in-time recoveries. These files are also use to resync replica nodes when they are created to avoid putting load on the primary.
The issue that you are experiencing is that the WAL archiving is not and has never been working. As a result, cluster nodes obviously will not be able to restore from the non-existent archive.
Your next step is to figure out why the WAL archiving is not working. In order to do this, you need to look at the logs for the primary as that is where is the WAL files are generated and shipped to S3.
For a conceptual overview, you should review the documentation here: https://cloudnative-pg.io/documentation/current/backup/#wal-archive
@fullykubed thank you. I was able to extract this error from the primary:
{"level":"info","ts":"2024-08-08T16:38:26Z","logger":"barman-cloud-check-wal-archive","msg":"2024-08-08 16:38:26,038 [57] ERROR: WAL archive check failed for server pg-88cc: Expected empty archive","pipe":"stderr","logging_pod":"pg-88cc-2"}
{"level":"error","ts":"2024-08-08T16:38:26Z","logger":"wal-archive","msg":"Error invoking barman-cloud-check-wal-archive","logging_pod":"pg-88cc-2","currentPrimary":"pg-88cc-2","targetPrimary":"pg-88cc-2","options":["--cloud-provider","aws-s3","s3://implentio-pg-88cc-backup-64c632f1cc08906d/","pg-88cc"],"exitCode":-1,"error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/barman/archiver.(*WALArchiver).CheckWalArchiveDestination\n\tpkg/management/barman/archiver/archiver.go:257\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.checkWalArchive\n\tinternal/cmd/manager/walarchive/cmd.go:417\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.run\n\tinternal/cmd/manager/walarchive/cmd.go:214\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:89\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"error","ts":"2024-08-08T16:38:26Z","msg":"while barman-cloud-check-wal-archive","logging_pod":"pg-88cc-2","error":"unexpected failure invoking barman-cloud-wal-archive: exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/log.Error\n\tpkg/management/log/log.go:163\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.checkWalArchive\n\tinternal/cmd/manager/walarchive/cmd.go:418\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.run\n\tinternal/cmd/manager/walarchive/cmd.go:214\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:89\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"error","ts":"2024-08-08T16:38:26Z","logger":"wal-archive","msg":"failed to run wal-archive command","logging_pod":"pg-88cc-2","error":"unexpected failure invoking barman-cloud-wal-archive: exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:95\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"info","ts":"2024-08-08T16:38:26Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-2","record":{"log_time":"2024-08-08 16:38:26.077 UTC","process_id":"38","session_id":"66b4f481.26","session_line_num":"1","session_start_time":"2024-08-08 16:38:25 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"archive command failed with exit code 1","detail":"The failed archive command was: /controller/manager wal-archive --log-destination /controller/log/postgres.json pg_wal/00000002.history","backend_type":"archiver","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:38:28Z","logger":"barman-cloud-check-wal-archive","msg":"2024-08-08 16:38:28,948 [123] ERROR: WAL archive check failed for server pg-88cc: Expected empty archive","pipe":"stderr","logging_pod":"pg-88cc-2"}
{"level":"error","ts":"2024-08-08T16:38:28Z","logger":"wal-archive","msg":"Error invoking barman-cloud-check-wal-archive","logging_pod":"pg-88cc-2","currentPrimary":"pg-88cc-2","targetPrimary":"pg-88cc-2","options":["--cloud-provider","aws-s3","s3://implentio-pg-88cc-backup-64c632f1cc08906d/","pg-88cc"],"exitCode":-1,"error":"exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/barman/archiver.(*WALArchiver).CheckWalArchiveDestination\n\tpkg/management/barman/archiver/archiver.go:257\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.checkWalArchive\n\tinternal/cmd/manager/walarchive/cmd.go:417\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.run\n\tinternal/cmd/manager/walarchive/cmd.go:214\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:89\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"error","ts":"2024-08-08T16:38:28Z","msg":"while barman-cloud-check-wal-archive","logging_pod":"pg-88cc-2","error":"unexpected failure invoking barman-cloud-wal-archive: exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/log.Error\n\tpkg/management/log/log.go:163\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.checkWalArchive\n\tinternal/cmd/manager/walarchive/cmd.go:418\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.run\n\tinternal/cmd/manager/walarchive/cmd.go:214\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:89\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"error","ts":"2024-08-08T16:38:28Z","logger":"wal-archive","msg":"failed to run wal-archive command","logging_pod":"pg-88cc-2","error":"unexpected failure invoking barman-cloud-wal-archive: exit status 1","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/walarchive.NewCmd.func1\n\tinternal/cmd/manager/walarchive/cmd.go:95\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}
{"level":"info","ts":"2024-08-08T16:38:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-2","record":{"log_time":"2024-08-08 16:38:28.986 UTC","process_id":"38","session_id":"66b4f481.26","session_line_num":"3","session_start_time":"2024-08-08 16:38:25 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"archive command failed with exit code 1","detail":"The failed archive command was: /controller/manager wal-archive --log-destination /controller/log/postgres.json pg_wal/00000002.history","backend_type":"archiver","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:38:28Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-2","record":{"log_time":"2024-08-08 16:38:28.986 UTC","process_id":"38","session_id":"66b4f481.26","session_line_num":"4","session_start_time":"2024-08-08 16:38:25 UTC","transaction_id":"0","error_severity":"WARNING","sql_state_code":"01000","message":"archiving write-ahead log file \"00000002.history\" failed too many times, will try again later","backend_type":"archiver","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:38:34Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-2","record":{"log_time":"2024-08-08 16:38:34.419 UTC","user_name":"streaming_replica","process_id":"152","connection_from":"10.0.99.58:60350","session_id":"66b4f48a.98","session_line_num":"1","command_tag":"START_REPLICATION","session_start_time":"2024-08-08 16:38:34 UTC","virtual_transaction_id":"3/0","transaction_id":"0","error_severity":"ERROR","sql_state_code":"58P01","message":"requested WAL segment 0000000700000008000000B0 has already been removed","query":"START_REPLICATION SLOT \"_cnpg_pg_88cc_1\" 8/B0000000 TIMELINE 7","application_name":"pg-88cc-1","backend_type":"walsender","query_id":"0"}}
{"level":"info","ts":"2024-08-08T16:38:40Z","logger":"postgres","msg":"record","logging_pod":"pg-88cc-2","record":{"log_time":"2024-08-08 16:38:40.128 UTC","user_name":"streaming_replica","process_id":"190","connection_from":"10.0.99.58:60356","session_id":"66b4f490.be","session_line_num":"1","command_tag":"START_REPLICATION","session_start_time":"2024-08-08 16:38:40 UTC","virtual_transaction_id":"3/0","transaction_id":"0","error_severity":"ERROR","sql_state_code":"58P01","message":"requested WAL segment 0000000700000008000000B0 has already been removed","query":"START_REPLICATION SLOT \"_cnpg_pg_88cc_1\" 8/B0000000 TIMELINE 7","application_name":"pg-88cc-1","backend_type":"walsender","query_id":"0"}}
@wesbragagt That error means that you had previously had a Postgres cluster that was writing to this S3 bucket before you created this current Postgres cluster.
Perhaps you ran a partial destroy of the module or deleted the Postgres cluster directly via k9s? Does that sound familiar?
A safety feature that cnpg provides is that it won't overwrite existing WAL files if you accidentally point a cluster to an S3 bucket with existing WAL files (as this would result in data loss for the original owner of the bucket).
You need to empty the S3 bucket and then restart your primary. The WAL archiving process should then continue successfully.
You can then try to restart the replica.
@fullykubed Just to confirm, this action would not cause any data loss correct? And I did have a partial apply when attempting to scale up CPU at one point this morning.
@wesbragagt I cannot say definitively without knowing what system generated the files that are currently in your bucket. You need to determine what system owns those. If it is a system you previously removed and have no need for the data, you can empty the bucket without impacting the currently running system.
However, what I can say is that your current live database (a) does not own those files and (b) is not backed up in S3 currently.
Closing the issue as @fullykubed was able to instruct me the following manual steps to recover:
kubectl cnpg backup -n <namespace> <cluster-name>
wait til the backup is doneterragrunt apply
Prior Search
What is your question?
I noticed that the primary instance of my pg cluster is in a pending initializing state where it cannot find a WAL file in s3
{"level":"info","ts":"2024-08-08T15:49:51Z","logger":"wal-restore","msg":"WAL file not found in the recovery object store","logging_pod":"pg-88cc-1","walName":"0000000500000008 │ │ 000000C2","options":["--cloud-provider","aws-s3","s3://implentio-pg-88cc-backup-64c632f1cc08906d/","pg-88cc"],"startTime":"2024-08-08T15:49:50Z","endTime":"2024-08-08T15:49:51Z","elapse │ │ dWalTime":0.416124562}
I've followed the recommendation of extracting the cluster status info
I've extracted the logs for the instance
What primary components of the stack does this relate to?
terraform
Code of Conduct