citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.67k stars 671 forks source link

master_add_node gets stuck waiting for a lock #2604

Open mtuncer opened 5 years ago

mtuncer commented 5 years ago

We have recently found an issue where master_add_node gets stuck while there is some load on the cluster.

There is an internal discussion at https://groups.google.com/a/citusdata.com/forum/?utm_medium=email&utm_source=footer#!msg/dev/en0Oq1Ufm78/nfFBD8-cAgAJ

It has not been reported by any user yet.

Initial investigation did not result in any tangible results.

@marcocitus had following comment

When master_add_node replicates the reference table it takes an ExclusiveLock on the shard ID of the reference table, which conflicts with any kind of write to the reference table, but as far as I can tell we do not lock the table itself.

The fact that there was a create_distributed_table involved is suspicious. As far as I can tell, we do not lock the shard ID of the reference table when we create a foreign key to a reference table. That's not usually a problem because the (access exclusive) table locks on the coordinator will block any kind of read or write, but not replication... What could end up happening is that master_add_node / reference table replication goes through, then the new worker opens a connection to an existing worker to read the contents of the reference table shard, but that blocks because another session is creating a foreign key to the reference table.

Again, this needs to be looked at more carefully, but this might be a somewhat critical bug. The connection that fetches a shard from another node does not do assign_distributed_transaction_id, which means distributed deadlock detection does not kick in and stuff gets stuck.

mtuncer commented 5 years ago

We saw this happening on one of the cluster in Citus Cloud. It used to occur during every cluster re-configuration, however, we could not reproduce manually by triggering reconfigure.

seaurching commented 2 years ago

I have same error for add node, this is a new cluster without any data. DB version 14.2 微信截图_20220217145112

When create extension citus excute on postgres database,the SELECT * from citus_add_node('10.90.196.187', 5432); will return success