Replication lag health check is wrong

fly-apps / postgres-ha

Postgres + Stolon for HA clusters as Fly apps.

Apache License 2.0

318 stars 131 forks source link

Replication lag health check is wrong #30

Closed mrkurt closed 3 years ago

mrkurt commented 3 years ago

Repl lag checks should only run against the primary, I tried to build them into the replicas and that was not a good choice. :)

These checks will be important for future app work. We need to be able to tell our proxy to stop serving traffic to apps in a region with a stale replica. So adding multiple "checks" to the list for each replica might make sense.

davissp14 commented 3 years ago

Yeah, I think this will be an easy fix.

If we want to communicate how far behind a standby is in "size", we can get the current WAL position of master by hitting pg_current_wal_lsn(), then we can just calculate the diff by comparing it with the last reported WAL position replayed or flushed per client. If we just want to communicate time, we can query the replay_lag/flush_lag on the pg_stat_replication table.

davissp14 commented 3 years ago

This has been addressed here: https://github.com/fly-apps/postgres-ha/blob/f33491928fa1571ee1fd422d8466fbfc47ab23ad/pkg/flycheck/pg.go#L124

New PG apps will have these changes, but I haven't rolled them out everywhere yet.