citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.5k stars 666 forks source link

The citus cluster has collapsed #7667

Open YMikheev opened 2 months ago

YMikheev commented 2 months ago

For an unknown reason, the cluster stopped working. His condition is: Coordinator: nodeid groupid nodename hasmetadata isactive metadatasynced shouldhaveshards 54 11 worker21 TRUE TRUE FALSE TRUE 51 12 worker23 TRUE TRUE TRUE TRUE

pg_dist_node is locked by ExclusiveLock with process without pid (because of that it is not possible to do anything with metadata: meta sync for example ) It is advisory lock - I do not now how to kill that lock ?

Worker21: nodeid groupid nodename isactive metadatasynced shouldhaveshards 54 11 worker21 TRUE TRUE TRUE 51 12 worker22 TRUE TRUE TRUE

the worker22 is not active (existing) worker. postgres log: WARNING: connection to the remote node worker22:**** failed with the following error: server closed the connection unexpectedly

Worker23: nodeid groupid nodename isactive metadatasynced shouldhaveshards 54 11 worker21 TRUE FALSE TRUE 51 12 worker23 TRUE TRUE TRUE

It is possible build the cluster?

emelsimsek commented 2 months ago

Can you see the locks following https://wiki.postgresql.org/wiki/Lock_Monitoring instructions?

YMikheev commented 2 months ago

if you try to synchronize metadata: SELECT start_metadata_sync_to_all_nodes();

select relation::regclass, * from pg_locks where not granted;

relation locktype database relation-2 page tuple virtualxid transactionid classid objid objsubid virtualtransaction pid mode granted fastpath waitstart pg_dist_node relation 16385 16546 NULL NULL NULL NULL NULL NULL NULL 9/4571 27664 RowShareLock FALSE FALSE 2024-08-06 16:54:35.652722+00

because of advisory lock is granted:

Screenshot 2024-08-06 at 19 51 16
emelsimsek commented 2 months ago

You can try to use this query https://wiki.postgresql.org/wiki/Lock_Monitoring#:~:text=information%20in%20pg_stat_activity-,%D0%A1ombination%20of%20blocked%20and%20blocking%20activity,-The%20following%20query

to find the blocking pid.

YMikheev commented 1 month ago

I killed the prepared transaction and now see: coordinator: the process 9420 "Citus Metadata Sync Daemon" is blocking pg_dist_node workers 0 and 3 (they have metadatasynced = false), the process has started: Picture1

after some days nothing happens: pg_dist_node is locked (on coordinator), worker 0 and 3 wait event "SyncRep"

YMikheev commented 1 month ago

I fixed worker 0 and 3.

YMikheev commented 1 month ago

there is still a problem with the coordinator. the process "Citus Metadata Sync Daemon" is blocking pg_dist_node pg_dist_node has metadatasynced = false

Screenshot 2024-08-27 at 21 48 55

when I try to kill a process "Citus Metadata Sync Daemon", it restarts and continues to block pg_dist_node.

It is possible to fix this problem?

YMikheev commented 1 month ago

Hello @emelsimsek.

Can you help me? It is very critical for us.

c2main commented 1 month ago

the worker22 is not active (existing) worker. postgres log: WARNING: connection to the remote node worker22:**** failed with the following error: server closed the connection unexpectedly

You probably want to fix that so citus can proceed...

YMikheev commented 1 month ago

Yes. Now we have one problem: the "Citus Metadata Sync Daemon" is blocking pg_dist_node.