how to perform a manual failover without applying any outstanding WAL?

CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.

https://access.crunchydata.com/documentation/postgres-operator/v5/

Apache License 2.0

3.94k stars 593 forks source link

how to perform a manual failover without applying any outstanding WAL? #3957

Open mzwettler2 opened 4 months ago

mzwettler2 commented 4 months ago

We have a simple configuration with 2 replicas (1 primary + 1 standby).

We configured the standby to run 3 hours behind the primary:

recovery_min_apply_delay: '3h' # standby 3 hours behind
synchronous_mode: false # asynchron

In case there is a logical problem on the primary (wrong data processing, misleaded application upgrade,…) we want to perform a manual failover to the standby, which still contains the old, correct data status. That means the standby should not apply any more WALs within the 3 hour residue in the event of a manual failover.

I could not find any working solution. Whatever I tried the standby first applied all outstanding WALs (which I don't want) and only then promoted the standby. Any idea how to get this working?

andrewlecuyer commented 3 months ago

Hi @mzwettler2 , if you want to restore your data back to a specific time (e.g. to three hours prior, specifically due to an issue with your data), this sounds like a Disaster Recovery (DR) scenario. More specifically, it sounds like you want a point-in-time restore (PITR), as discussed in the following docs:

https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/backups-disaster-recovery/disaster-recovery#perform-an-in-place-point-in-time-recovery-pitr

I therefore recommend this as the best/safest way to meet your needs/use-case.

mzwettler2 commented 3 months ago

Hi @andrewlecuyer, thanks for your answer.

we have to be back online very quickly in the event of an error. the database is very large. a PITR would take too long. that's why we want to work with a lagging standby database, which is a very common use case. we also do this in classic database operation. unfortunately, the crunchy implementation does not currently enable it on kubernetes.