When a multi-shard SELECT query has a CTE or non-pushdownable subquery, we execute the query in a distributed transaction even if the SELECT is not in a transaction block. This means that we first send BEGIN on each connection and send commit/abort at the end of the transaction on the coordinator. Between the execution of individual steps, the connections are in an idle in transaction state. Postgres connections are a scarce resource and we should not allow them to be idle, since it increases the likelihood of the worker node running out of connections (or memory?).
The reason we use a distributed transaction when there is a CTE/non-pushdownable subquery is that the intermediate result is coupled to the distributed transaction ID. Queries can only access the intermediate result if they are part of the same distributed transaction and the result is removed when the distributed transaction ends. However, we can actually commit/abort individual worker transactions and later start a new transaction which is assigned the same distibuted transaction ID. To avoid adding extra round-trips, we could send the transaction block as a single batch BEGIN; SELECT assign_distributed_transaction_id(...); SELECT ... ; COMMIT. That way, the connection is released into the pgbouncer pool immediately after the SELECT finishes and can be used for other work.
None of this applies if the multi-shard SELECT is in a transaction block, especially if it needs to be able to read uncommitted results. However, it could provide much better throughput for single-statement, multi-shard SELECTs.
It may actually be worth doing this for all multi-shard SELECTs to allow operators to know which query on a worker belongs to which distributed query on the coordinator.
When a multi-shard SELECT query has a CTE or non-pushdownable subquery, we execute the query in a distributed transaction even if the SELECT is not in a transaction block. This means that we first send BEGIN on each connection and send commit/abort at the end of the transaction on the coordinator. Between the execution of individual steps, the connections are in an idle in transaction state. Postgres connections are a scarce resource and we should not allow them to be idle, since it increases the likelihood of the worker node running out of connections (or memory?).
The reason we use a distributed transaction when there is a CTE/non-pushdownable subquery is that the intermediate result is coupled to the distributed transaction ID. Queries can only access the intermediate result if they are part of the same distributed transaction and the result is removed when the distributed transaction ends. However, we can actually commit/abort individual worker transactions and later start a new transaction which is assigned the same distibuted transaction ID. To avoid adding extra round-trips, we could send the transaction block as a single batch
BEGIN; SELECT assign_distributed_transaction_id(...); SELECT ... ; COMMIT
. That way, the connection is released into the pgbouncer pool immediately after the SELECT finishes and can be used for other work.None of this applies if the multi-shard SELECT is in a transaction block, especially if it needs to be able to read uncommitted results. However, it could provide much better throughput for single-statement, multi-shard SELECTs.