Checks timeout, probably waiting for a connection to primary

mrkurt commented 3 years ago

The check code uses all the app IPs to find the primary. I think this might be causing some of the check timeouts, it would be quicker and hopefully less brittle to use IPs from the primary region only.

davissp14 commented 3 years ago

Agreed, I have a refactor in progress that should address this. Also, i'm thinking we could probably just pull leadership information from the local node. We can resolve leadership information by running show primary_conninfo; on the secondary and if we wanted to be super quick about it we could yank it from the postgresql.conf file itself.

davissp14 commented 2 years ago

I made quite a few optimizations on this front. While we may continue to see timeouts occasionally, but we will at least be provided with details as far as what exactly timed out.

If it's a particular check that caused the timeout, we will see a checkout output that looks something like:

[✓] transactions: readonly (239.05µs)
[✗] replication: Timed out (9.99s)
[-] connections: Not processed

If we see a connection timeout before the checks are processed, we will see the same old "Context deadline exceeded", except it will have some additional information pertaining to the node it's failing to connect to.

These changes are available in the latest PG12/PG13 images.

fly-apps / postgres-ha

Checks timeout, probably waiting for a connection to primary #31