Closed mrkurt closed 3 years ago
Yeah, I think this will be an easy fix.
If we want to communicate how far behind a standby is in "size", we can get the current WAL position of master by hitting pg_current_wal_lsn()
, then we can just calculate the diff by comparing it with the last reported WAL position replayed or flushed per client. If we just want to communicate time, we can query the replay_lag/flush_lag on the pg_stat_replication
table.
This has been addressed here: https://github.com/fly-apps/postgres-ha/blob/f33491928fa1571ee1fd422d8466fbfc47ab23ad/pkg/flycheck/pg.go#L124
New PG apps will have these changes, but I haven't rolled them out everywhere yet.
Repl lag checks should only run against the primary, I tried to build them into the replicas and that was not a good choice. :)
These checks will be important for future app work. We need to be able to tell our proxy to stop serving traffic to apps in a region with a stale replica. So adding multiple "checks" to the list for each replica might make sense.