fly-apps / postgres-flex

Postgres HA setup using repmgr
87 stars 43 forks source link

Panic when the number of replicas is bigger than `max_wal_senders` #176

Open rugwirobaker opened 1 year ago

rugwirobaker commented 1 year ago

When you try to add one replica beyond max_wal_senders it panics on boot.

P2023-03-20T22:09:14Z app[21781770c63089] den [info]Provisioning standby
2023-03-20T22:09:15Z app[21781770c63089] den [info]repmgr -h fdaa:0:c688:a7b:d5a6:6646:f15a:2 -p 5433 -d repmgr -U repmgr -f /data/repmgr.conf standby clone -c -F
2023-03-20T22:09:22Z app[21781770c63089] den [info]panic: failed to clone primary: failed to clone primary: exit status 1
2023-03-20T22:09:22Z app[21781770c63089] den [info]goroutine 1 [running]:
2023-03-20T22:09:22Z app[21781770c63089] den [info]main.panicHandler({0x9a0c40?, 0xc0004303e0})
2023-03-20T22:09:22Z app[21781770c63089] den [info] /go/src/github.com/fly-examples/fly-postgres/cmd/start/main.go:100 +0x55
2023-03-20T22:09:22Z app[21781770c63089] den [info]main.main()
2023-03-20T22:09:22Z app[21781770c63089] den [info] /go/src/github.com/fly-examples/fly-postgres/cmd/start/main.go:34 +0xadd

We should handle this gracefully by logging an error or even return one to the user. Perhaps we should even automatically check the max_wal_senders setting and optionally update it before adding the new replica.

davissp14 commented 1 year ago

This only impacts the new replica coming up, right?

rugwirobaker commented 1 year ago

Yep only the new replica goes into a restart loop cause it can't pull rempgr from the primary.

guillaumervls commented 1 year ago

I'm adding the "not now" flag since this looks like a pain only when you have >10 replicas (by default, max_wal_senders is 10 right?)

Maybe this can be "temp fixed" by a note in the docs?

rugwirobaker commented 1 year ago

yep, this only becomes an issue when you =>10 replicas.