CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.93k stars 591 forks source link

Getting error ERROR: [078]: unable to remap invalid link 'pg_wal' for standby cluster creation of Postgres #3827

Closed Sreeragsrg77 closed 9 months ago

Sreeragsrg77 commented 9 months ago

ERROR: [078]: unable to remap invalid link 'pg_wal' Fri Jan 19 04:26:54 UTC 2024 ERROR: pgBackRest standby Creation: pgBackRest restore failed when creating standby 2024-01-19 04:26:54,853 ERROR: Error creating replica using method pgbackrest_standby: /opt/crunchy/bin/postgres-ha/pgbackrest/pgbackrest-create-replica.sh standby exited with code=78 2024-01-19 04:26:54,854 ERROR: failed to bootstrap clone from remote master None 2024-01-19 04:26:54,855 INFO: Removing data directory: /pgdata/ 2024-01-19 04:27:04,547 INFO: removing initialize key after failed attempt to bootstrap the cluster Process Process-1: Traceback (most recent call last): File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.6/site-packages/patroni/init.py", line 139, in patroni_main abstract_main(Patroni, schema) File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 100, in abstract_main controller.run() File "/usr/local/lib/python3.6/site-packages/patroni/init.py", line 109, in run super(Patroni, self).run() File "/usr/local/lib/python3.6/site-packages/patroni/daemon.py", line 59, in run self._run_cycle() File "/usr/local/lib/python3.6/site-packages/patroni/init.py", line 112, in _run_cycle logger.info(self.ha.run_cycle()) File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1469, in run_cycle info = self._run_cycle() File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1343, in _run_cycle return self.post_bootstrap() File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1236, in post_bootstrap self.cancel_initialization() File "/usr/local/lib/python3.6/site-packages/patroni/ha.py", line 1229, in cancel_initialization raise PatroniFatalException('Failed to bootstrap cluster') patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'

dsessler7 commented 9 months ago

@Sreeragsrg77, what version of PGO, postgres, and pgbackrest are you using? Can you provide PGO logs? Did you follow the directions in the doc below when setting up the standby cluster?

https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/backups-disaster-recovery/disaster-recovery#standby-cluster

Sreeragsrg77 commented 9 months ago

@dsessler7 my PGO Version is version=4.7.4 , psql (PostgreSQL) 13.5,pgBackRest 2.33

Below are the current PGO logs

Defaulted container "apiserver" out of: apiserver, operator, scheduler, event time="2023-12-14T06:05:35Z" level=info msg="debug flag set to false" time="2023-12-14T06:05:40Z" level=info msg="postgres-operator apiserver starts" func="main.main()" file="cmd/apiserver/main.go:111" version=4.7.4 time="2023-12-14T06:05:40Z" level=info msg="Pgo Namespace is [pgo]" func="internal/apiserver.Initialize()" file="internal/apiserver/root.go:100" version=4.7.4 time="2023-12-14T06:05:40Z" level=info msg="InstallationName is [devtest]" func="internal/apiserver.Initialize()" file="internal/apiserver/root.go:107" version=4.7.4 time="2023-12-14T06:05:40Z" level=info msg="apiserver starts" func="internal/apiserver.Initialize()" file="internal/apiserver/root.go:119" version=4.7.4 time="2023-12-14T06:05:40Z" level=info msg="loading PermMap with 56 Permissions\n" func="internal/apiserver.initializePerms()" file="internal/apiserver/perms.go:179" version=4.7.4 time="2023-12-14T06:05:40Z" level=info msg="Config: \"pgo-config\" ConfigMap found, using config files from the configmap" func="internal/config.initialize()" file="internal/config/pgoconfig.go:751" version=4.7.4 I1214 06:05:41.823141 1 request.go:668] Waited for 1.012642997s due to client-side throttling, not priority and fairness, request: GET:https://10.237.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s time="2023-12-14T06:05:43Z" level=info msg="default instance memory set to [128Mi]" func="internal/config.(PgoConfig).Validate()" file="internal/config/pgoconfig.go:393" version=4.7.4 time="2023-12-14T06:05:43Z" level=info msg="default pgbackrest repository memory set to [48Mi]" func="internal/config.(PgoConfig).Validate()" file="internal/config/pgoconfig.go:399" version=4.7.4 time="2023-12-14T06:05:43Z" level=info msg="default pgbouncer memory set to [24Mi]" func="internal/config.(*PgoConfig).Validate()" file="internal/config/pgoconfig.go:405" version=4.7.4 time="2023-12-14T06:05:43Z" level=info msg="BasicAuth is true" func="internal/apiserver.initConfig()" file="internal/apiserver/root.go:190" version=4.7.4 time="2023-12-14T06:05:43Z" level=info msg="Namespace operating mode is 'dynamic'" func="internal/apiserver.Initialize()" file="internal/apiserver/root.go:151" version=4.7.4 time="2023-12-14T06:05:43Z" level=info msg="pgo.tls Secret NOT found in namespace pgo" func="internal/apiserver.WriteTLSCert()" file="internal/apiserver/root.go:407" version=4.7.4 time="2023-12-14T06:05:43Z" level=info msg="listening on port 8443" func="main.main()" file="cmd/apiserver/main.go:182" version=4.7.4 2024/01/16 08:42:12 http: TLS handshake error from 127.0.0.1:37826: tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2024-01-16T08:42:12Z is after 2023-12-10T08:13:22Z 2024/01/16 08:46:52 http: TLS handshake error from 127.0.0.1:37830: tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2024-01-16T08:46:52Z is after 2023-12-10T08:13:22Z 2024/01/16 09:14:45 http: TLS handshake error from 127.0.0.1:37842: tls: failed to verify client certificate: x509: certificate has expired or is not yet valid: current time 2024-01-16T09:14:45Z is after 2023-12-10T08:13:22Z 2024/01/16 09:18:07 http: TLS handshake error from 127.0.0.1:37846: tls: failed to verify client certificate: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "serial:7978071590624879016")

Sreeragsrg77 commented 9 months ago

I was able to resolve this by restarting my pgo pods and redeploy the standby cluster

dsessler7 commented 9 months ago

Glad you got it sorted out. I will say that 4.7 is very old at this point and is essentially EOL (except for customers with an "Extended Support Subscription" as indicated here), so I highly recommend upgrading to the latest and greatest CPK (currently 5.5.0).