citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.12k stars 651 forks source link

Connection management improvements for multi-database setup #7244

Open ivyazmitinov opened 9 months ago

ivyazmitinov commented 9 months ago

This issue is more of an invitation to a technical discussion than a solid proposal.

Currently, the main issue with having multiple DATABASEs in a cluster is connection management which is not adapted to this. The issue can be split into two parts:

  1. The most problematic areas are the transaction recovery and distributed deadlock detection mechanisms that do not respect citus.max_shared_pool_size and open at least n * d connections per worker, where n -- number of nodes in cluster, d -- number of DATABASEs
  2. Even if they respect the citus.max_shared_pool_size, this limit is applied per DATABASE, meaning that it will still require n * d connections.

In order to overcome this, I propose to:

  1. Make citus.max_shared_pool_size cluster-wide. Per-database setting then may be used to set a connection quota for a database within the global citus.max_shared_pool_size
  2. Make the transaction recovery and distributed deadlock detection respect the improved citus.max_shared_pool_size.

Since those changes make sense mostly for the multi-database setup, they may be enabled by a single GUC, like citus.multi_database, or have a separate GUC for the behaviour of citus.max_shared_pool_size, transaction recovery, and distributed deadlock detection.

marcoslot commented 9 months ago

Make citus.max_shared_pool_size cluster-wide. Per-database setting then may be used to set a connection quota for a database within the global citus.max_shared_pool_size

Seems reasonable, not sure why we decided to include database in the key.

Make the transaction recovery and distributed deadlock detection respect the improved citus.max_shared_pool_size.

I think as a short-term fix they probably should not count towards max_shared_pool_size, since there is not much to be gained from it.

The longer term fix is to avoid a persistent connection per database. Connections from the maintenance daemon are cached because of how frequently the deadlock detector asks for lock graphs. However, we don't actually need a deadlock detector per database. Just one is enough.

The other (database-specific) operations in the maintenance daemon are much less frequent and could afford open a new connection every time, and either honour max_shared_pool_size (block if none available) or have a separate shared pool.

Thanks for sharing your feedback on this!

ivyazmitinov commented 9 months ago

not sure why we decided to include database in the key

According to the comment, it is there specificaly to support the per-database setting :slightly_smiling_face:

The longer term fix is to avoid a persistent connection per database. Connections from the maintenance daemon are cached because of how frequently the deadlock detector asks for lock graphs. However, we don't actually need a deadlock detector per database. Just one is enough. The other (database-specific) operations in the maintenance daemon are much less frequent and could afford open a new connection every time, and either honour max_shared_pool_size (block if none available) or have a separate shared pool.

Agree, this improvement looks great! I was more focused on limiting the number of connections to workers than their caching, but remove this unnecessary action will be also beneficial.

either honour max_shared_pool_size (block if none available) or have a separate shared pool.

I would go with a separate shared pool, but introduce an option to configure it as a percentage of max_shared_pool_size, like max_shared_pool_maintenance_percent, dedicated to maintenance operations only. This way all the operations will be bound by the actual amount of the connection slots available, but maintenance operations won't compete with user queries.

Thanks for sharing your feedback on this!

I am actually up to a contribution (we suffered a long enough from this), so if you approve, I may start right away :slightly_smiling_face:

onderkalaci commented 9 months ago

either honour max_shared_pool_size (block if none available) or have a separate shared pool.

I would go with a separate shared pool, but introduce an option to configure it as a percentage of max_shared_pool_size, like max_shared_pool_maintenance_percent, dedicated to maintenance operations only. This way all the operations will be bound by the actual amount of the connection slots available, but maintenance operations won't compete with user queries.

The main reason we implemented citus.max_shared_pool_size is to prevent the coordinator establishing more connections than any worker can accept (e.g., citus.max_shared_pool_size < max_connections of worker)

I think making citus.max_shared_pool_size global (e.g., not per database) sounds useful and aligns with why we added it. I think one important caveat here is that citus.max_shared_pool_size cannot/should not be able to set per-database, otherwise things might become complicated. As far as I know, SIGHUP GUCs -- such as shared_pool_size -- cannot be set per database.

When it comes to a separate pool vs some percentage in the pool, I'm very much in favor of the latter. The former could be very hard to implement.

I think even the latter is non-trivial to implement, there are bunch of areas needs to be thought of well, such as make sure that this is not broken: https://github.com/citusdata/citus/blob/main/src/backend/distributed/connection/locally_reserved_shared_connections.c or https://github.com/citusdata/citus/blob/main/src/backend/distributed/executor/adaptive_executor.c#L4707

So, please make sure we have good test coverage involving these areas as well as base scenarios with multiple databases.

The longer term fix is to avoid a persistent connection per database

also, see https://github.com/citusdata/citus/issues/3671 as a reference

I am actually up to a contribution (we suffered a long enough from this), so if you approve, I may start right away 🙂

Great!

ivyazmitinov commented 9 months ago

Thanks for the useful insights, @onderkalaci!

Before I start, I have one question regarding the deadlock detector:

However, we don't actually need a deadlock detector per database. Just one is enough

I assume, the way to solve this is to have one "management" database exactly for such cluster-wide procedures, and that is being implemented now in the pool_2pc branch?

marcoslot commented 9 months ago

I assume, the way to solve this is to have one "management" database exactly for such cluster-wide procedures, and that is being implemented now in the pool_2pc branch?

Yes, I think so. The goal of that infrastructure is to be able to sync and track role & database-related commands across the cluster by funneling them all through the management database (even from non-Citus databases). We could then also decide to run the deadlock detector only in the management database.