[question]: kube_pg_cluster failover delay too long

wesbragagt commented 3 months ago

Prior Search

[X] I have already searched this project's issues to determine if a similar question has already been asked.

What is your question?

I'm trying to understand if I should lower the fail over delay for switching primaries in a cluster in order to allow for quicker connection recovery for long running ETL jobs that might face a connection hiccup. I have a few areas that I need clarity to make a better decision when it comes to configuring my cluster:

stopDelay is set to a local computed variable stopDelay = var.pg_shutdown_timeout != null ? var.pg_shutdown_timeout : (var.burstable_instances_enabled || var.spot_instances_enabled ? (10 + 60) : (15 * 60 + 10)) could you help me understand why those numbers?
Is it ok to have a faster failoverDelay such as 0? Are there any trade-offs I might not be aware?
Would that be something that pgbouncer can help applications to reconnect on connection hiccups due to pods restarting?

What primary components of the stack does this relate to?

terraform

Code of Conduct

[X] I agree to follow this project's Code of Conduct

fullykubed commented 3 months ago

There are a few different things to consider here:

Question 1

smartShutdownTimeout: The amount of time after receiving a termination signal that postgres is allowed to continue running but will not accept new connections (this allows running queries to continue executing and hopefully finish).
stopDelay: The amount of time after receiving a termination signal the postgres will begin a fast shutdown (terminating existing connections). This should always be 10-30 seconds larger than smartShutdownTimeout.

See these docs.

We set these limits to a low value (70 seconds) if running on spot nodes (burstable/spot) b/c we need to ensure that graceful termination succeeds before the spot instance is shutdown due to a spot interruption. This prevents data loss. It is a higher value (15 minutes) when using on-demand instances b/c node termination isn't forced.

The key tradeoff is this: A bigger smartShutdownTimeout gives more time for existing queries to finish executing but it also increases the amount of time that new queries cannot be submitted (i.e., the database will look like it is down to clients).

It would be perfectly reasonable to set smartShutdownTimeout to 0 and thereby minimize your overall downtime, keeping in mind that this will immediately cancel your running queries when a pg pod terminates. In fact, I think this should probably be the default of the module, and I will likely update this in the near future.

Question 2

The failoverDelay controls how long the primary pod has to be in an unhealthy / terminating state and/or missing before CNPG will promote the replica to be a primary.

This value overrides smartShutdownTimeout so if you set failoveroverDelay to 0, then smartShutdownTimeout is effectively 0 for the primary pod. This means that any running queries will be immediately canceled and will not complete as soon as the pod receives a termination signal.

Upon reviewing this logic and running several tests, I think in almost all circumstances failoverDelay should equal smartShutdownTimeout, so I am going to update the module to reflect that.

Question 3

This depends on what you mean.

PgBouncer will not automatically retry queries that are interrupted due to a database failover.

However, it will ensure that connections with your application clients are not closed even if PgBouncer itself needs to reset its connections with the upstream database.

Additionally, PgBouncer can queue connections / queries so that your clients can connect even if the upstream database is down. The connections are simply deferred until the upstream is made available.

This can reduce the impact a database failover will have on your client application as it would not need to reset all of its connections. This all said, this does depend on the pool_mode you are using.

This article might be helpful: https://blog.digitalis.io/using-pgbouncer-with-consul-for-postgresql-high-availability-686049c232fc?gi=cd89a7461c11

wesbragagt commented 3 months ago

@fullykubed Thank you, This detailed explanation helped me and I'm sure future users with the same question when trying to make a decision on their optimal setup.

Panfactum / stack