basho / riak

Riak is a decentralized datastore from Basho Technologies.
http://docs.basho.com
Apache License 2.0
3.96k stars 537 forks source link

Infinite waiting handoff #1006

Open sAws opened 4 years ago

sAws commented 4 years ago

Hello.

server_[1:3] - riak 2.9.0p5
server_[4:6] - riak 2.1.1
# riak-admin transfers
'riak@server_1' waiting to handoff 3 partitions
'riak@server_2' waiting to handoff 3 partitions
'riak@server_3' waiting to handoff 3 partitions
'riak@server_4' waiting to handoff 19 partitions
'riak@server_5' waiting to handoff 20 partitions
'riak@server_6' waiting to handoff 22 partitions

Active Transfers:

I'm use riak-admin repair-2i on one server and see this log

2020-02-13 17:17:43.893 [error] <0.14955.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 1027618338748291114361965898003636498195577569280
2020-02-13 17:17:43.893 [error] <0.14974.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 890602560248518965780370444936484965102833893376
2020-02-13 17:17:43.893 [error] <0.14867.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 1233142006497949337234359077604363797834693083136
2020-02-13 17:17:43.894 [error] <0.14912.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 1370157784997721485815954530671515330927436759040
2020-02-13 17:17:43.894 [error] <0.14824.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 685078892498860742907977265335757665463718379520
2020-02-13 17:17:43.894 [error] <0.14949.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 822094670998632891489572718402909198556462055424
2020-02-13 17:17:43.895 [error] <0.14915.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 1438665674247607560106752257205091097473808596992
2020-02-13 17:17:43.895 [error] <0.14950.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 753586781748746817198774991869333432010090217472
2020-02-13 17:17:43.895 [error] <0.14918.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 342539446249430371453988632667878832731859189760
2020-02-13 17:17:43.896 [error] <0.14953.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 411047335499316445744786359201454599278231027712
2020-02-13 17:17:43.896 [error] <0.13682.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 137015778499772148581595453067151533092743675904
2020-02-13 17:17:43.896 [error] <0.14947.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 274031556999544297163190906134303066185487351808
2020-02-13 17:17:43.897 [error] <0.14879.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 1096126227998177188652763624537212264741949407232
2020-02-13 17:17:43.897 [error] <0.14945.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 0
2020-02-13 17:17:43.898 [error] <0.14924.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 959110449498405040071168171470060731649205731328
2020-02-13 17:20:28.861 [error] <0.18978.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 205523667749658222872393179600727299639115513856
2020-02-13 17:20:28.861 [error] <0.17227.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 1301649895747835411525156804137939564381064921088
2020-02-13 17:20:28.861 [error] <0.17884.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 479555224749202520035584085735030365824602865664
2020-02-13 17:20:28.862 [error] <0.17780.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 548063113999088594326381812268606132370974703616
2020-02-13 17:20:28.862 [error] <0.18994.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 1164634117248063262943561351070788031288321245184
2020-02-13 17:20:28.862 [error] <0.18913.87>@riak_kv_2i_aae:repair_partition:297 Failed to acquire hashtree lock on partition 616571003248974668617179538802181898917346541568

What can be done?

martinsumner commented 4 years ago

The combination of a cluster-wide 2i repair, with pending handoffs in a mixed version cluster is not tested. 2i repair only tested with a common version, with the cluster in a stable state (and only with the eleveldb backend - it doesn't work with other 2i supporting backends).

I think it was release 2.2 that introduced a version uplift to the AAE trees. So it might be that it is failing to lock based on a version mismatch. There were different scenarios tested for this uplift, but cluster-wide 2i repair wasn't one of them.

There's unlikely to be a quick and simple answer to "what can be done?". You may just have a lot of trees locked for rebuilds, and as the rebuilds complete you will be free to run 2i repair again. But the problem might be more involved, and the only way to be sure that 2i repair will behave as expected would be to run it in its tested state.

Sorry that this is a bit of an unhelpful answer. Perhaps someone else might have the time to dig deeper and give you a better answer.

sAws commented 4 years ago

Thanks! It prompted me to use riak-admin down. Tomorrow I will write the result.

sAws commented 4 years ago

riak-admin down don't help. But now i see this error:

Partition: 662242929415565384811044689824565743281594433536
Error: {no_aae_pid,undefined_aae_pid}