citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.46k stars 663 forks source link

Connection Management improvements when the shards are on the node itself #4178

Closed onderkalaci closed 3 years ago

onderkalaci commented 4 years ago

When a node holds both the shards and the metadata, two things could happen concurrently:

The above situation makes it hard to do automatic connection management. In adaptive connection management (#3692), the coordinator is the only one establishing connections to the worker nodes. So, it can control all the connections going to the workers and react accordingly (e.g., automatically adjust the pool size if total number of outgoing connections gets close to the max_connections on the workers).

In the two cases explained above, the problem is a different. Regular clients are also getting slots from max_connections, and Citus is competing with them when it needs to connect back it self for parallel queries.

The above is valid for two case:

The former is much more pressing because all the loads is on a single node, sharing a single max_connections pool.

We have several different ideas to tackle this problem, and we might end-up doing multiple of those:

1) Significantly reduce parallelism above some #connections threshold to keep a safe margin 2) Make the executor fall back to local execution after connection failure 3) Count all of the backends and make sure to never exceed max_connections 4) Increase max_connections significantly on Hyperscale(Citus). The current default is 300, but is that really a good default?

For combinations of (2) and (3), I have this prototype: https://github.com/citusdata/citus/tree/move_local_execution_to_end

onderkalaci commented 3 years ago

Fixed by #4338. Some minor improvements can be done via different issues/PRs