basho / riak_core

Distributed systems infrastructure used by Riak.
Apache License 2.0
1.23k stars 392 forks source link

Joining a node after committing a plan - transfers freeze & cluster state is stuck #996

Open martinsumner opened 1 year ago

martinsumner commented 1 year ago

To replicate:

The transfers stop at the point the additional nodes join - and the cluster ends up stuck in that state:

dev/dev4/riak/bin/riak admin cluster plan
=============================== Staged Changes ================================
Action         Details(s)
-------------------------------------------------------------------------------
join           'dev2@127.0.0.1'
join           'dev3@127.0.0.1'
join           'dev4@127.0.0.1'
-------------------------------------------------------------------------------

NOTE: Applying these changes will result in 1 cluster transition

###############################################################################
                         After cluster transition 1/1
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid     100.0%     25.0%    dev1@127.0.0.1
valid       0.0%     25.0%    dev2@127.0.0.1
valid       0.0%     25.0%    dev3@127.0.0.1
valid       0.0%     25.0%    dev4@127.0.0.1
-------------------------------------------------------------------------------
Valid:4 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

Transfers resulting from cluster changes: 48
  16 transfers from 'dev1@127.0.0.1' to 'dev4@127.0.0.1'
  16 transfers from 'dev1@127.0.0.1' to 'dev3@127.0.0.1'
  16 transfers from 'dev1@127.0.0.1' to 'dev2@127.0.0.1'

$ dev/dev4/riak/bin/riak admin cluster commit
Cluster changes committed
$ dev/dev4/riak/bin/riak admin cluster status
---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+
|        node        |status| avail |ring |pending|
+--------------------+------+-------+-----+-------+
| (C) dev1@127.0.0.1 |valid |  up   |100.0|  25.0 |
|     dev2@127.0.0.1 |valid |  up   |  0.0|  25.0 |
|     dev3@127.0.0.1 |valid |  up   |  0.0|  25.0 |
|     dev4@127.0.0.1 |valid |  up   |  0.0|  25.0 |
+--------------------+------+-------+-----+-------+

Key: (C) = Claimant; availability marked with '!' is unexpected
$ dev/dev5/riak/bin/riak admin cluster join dev1@127.0.0.1
Success: staged join request for 'dev5@127.0.0.1' to 'dev1@127.0.0.1'
$ dev/dev6/riak/bin/riak admin cluster join dev1@127.0.0.1
Success: staged join request for 'dev6@127.0.0.1' to 'dev1@127.0.0.1'
$ dev/dev4/riak/bin/riak admin cluster status
---- Cluster Status ----
Ring ready: true

+--------------------+-------+-------+-----+-------+
|        node        |status | avail |ring |pending|
+--------------------+-------+-------+-----+-------+
|     dev5@127.0.0.1 |joining|  up   |  0.0|   0.0 |
|     dev6@127.0.0.1 |joining|  up   |  0.0|   0.0 |
| (C) dev1@127.0.0.1 | valid |  up   | 71.9|  25.0 |
|     dev2@127.0.0.1 | valid |  up   |  9.4|  25.0 |
|     dev3@127.0.0.1 | valid |  up   | 10.9|  25.0 |
|     dev4@127.0.0.1 | valid |  up   |  7.8|  25.0 |
+--------------------+-------+-------+-----+-------+

Key: (C) = Claimant; availability marked with '!' is unexpected
$ dev/dev4/riak/bin/riak admin cluster status
---- Cluster Status ----
Ring ready: false

+--------------------+-------+-------+-----+-------+
|        node        |status | avail |ring |pending|
+--------------------+-------+-------+-----+-------+
|     dev5@127.0.0.1 |joining|  up   |  0.0|   0.0 |
|     dev6@127.0.0.1 |joining|  up   |  0.0|   0.0 |
| (C) dev1@127.0.0.1 | valid |  up   | 71.9|  25.0 |
|     dev2@127.0.0.1 | valid |  up   |  9.4|  25.0 |
|     dev3@127.0.0.1 | valid |  up   | 10.9|  25.0 |
|     dev4@127.0.0.1 | valid |  up   |  7.8|  25.0 |
+--------------------+-------+-------+-----+-------+

Key: (C) = Claimant; availability marked with '!' is unexpected

$ dev/dev4/riak/bin/riak admin cluster status
---- Cluster Status ----
Ring ready: true

+--------------------+-------+-------+-----+-------+
|        node        |status | avail |ring |pending|
+--------------------+-------+-------+-----+-------+
|     dev5@127.0.0.1 |joining|  up   |  0.0|   0.0 |
|     dev6@127.0.0.1 |joining|  up   |  0.0|   0.0 |
| (C) dev1@127.0.0.1 | valid |  up   | 71.9|  25.0 |
|     dev2@127.0.0.1 | valid |  up   |  9.4|  25.0 |
|     dev3@127.0.0.1 | valid |  up   | 10.9|  25.0 |
|     dev4@127.0.0.1 | valid |  up   |  7.8|  25.0 |
+--------------------+-------+-------+-----+-------+

Key: (C) = Claimant; availability marked with '!' is unexpected
$ dev/dev4/riak/bin/riak admin cluster status
---- Cluster Status ----
Ring ready: true

+--------------------+-------+-------+-----+-------+
|        node        |status | avail |ring |pending|
+--------------------+-------+-------+-----+-------+
|     dev5@127.0.0.1 |joining|  up   |  0.0|   0.0 |
|     dev6@127.0.0.1 |joining|  up   |  0.0|   0.0 |
| (C) dev1@127.0.0.1 | valid |  up   | 71.9|  25.0 |
|     dev2@127.0.0.1 | valid |  up   |  9.4|  25.0 |
|     dev3@127.0.0.1 | valid |  up   | 10.9|  25.0 |
|     dev4@127.0.0.1 | valid |  up   |  7.8|  25.0 |
+--------------------+-------+-------+-----+-------+

Key: (C) = Claimant; availability marked with '!' is unexpected
$ dev/dev4/riak/bin/riak admin transfers
'dev6@127.0.0.1' waiting to handoff 30 partitions
'dev5@127.0.0.1' waiting to handoff 30 partitions
'dev4@127.0.0.1' waiting to handoff 27 partitions
'dev3@127.0.0.1' waiting to handoff 30 partitions
'dev2@127.0.0.1' waiting to handoff 11 partitions

Active Transfers:

$ dev/dev4/riak/bin/riak admin transfers
'dev6@127.0.0.1' waiting to handoff 30 partitions
'dev5@127.0.0.1' waiting to handoff 30 partitions
'dev4@127.0.0.1' waiting to handoff 27 partitions
'dev3@127.0.0.1' waiting to handoff 30 partitions
'dev2@127.0.0.1' waiting to handoff 11 partitions

Active Transfers: