basho / riak_core

Distributed systems infrastructure used by Riak.
Apache License 2.0
1.23k stars 392 forks source link

Add Rack awareness support #967

Closed systream closed 3 years ago

systream commented 3 years ago

In this PR location (you can also call it site or availability zone) has been introduced. When claiming a new ring the list of nodes is ordered taking into consideration the location of the individual nodes, in a manner that to adjacent nodes are preferably from different locations.

martinsumner commented 3 years ago

Thank-you for this. I've taken a quick look, and I want to make sure I understand this correctly.

  1. The aim is data diversity across locations to improve data protection (e.g. try to make sure data is in multiple Availability Zones).
  2. This doesn't offer any guarantee that data for any particular preflist is stored in diverse locations, it just tries to bias the feed into the algorithm to make it more likely that data is stored in diverse locations.

Is my understanding correct?

systream commented 3 years ago

1, Yes, exactly. 2, You are right, there are circumstances where it cannot guarantee the diverse of locations, it is just like not optimal node count, ring size combinations where tail violations could happen.

It should be used responsibly as the other options.

I tested it across different ring-size/node count/locations combo, and it works quite well, but for example it won't optimize transfers between old and new ring.

The aim was to not modify the original claiming algorithm, until one or more locations have set.

martinsumner commented 3 years ago

The challenge is how much value there is, without any guarantee. Perhaps, it would be nice if there was a CLI command that would allow you to report on how location-safe the cluster plan is.

Previously, some consideration had been given to resurrecting claim v3 to implement things like location awareness. The claim v3 algorithm was a radical change, to treat claim as an optimisation problem. With claim v3, if you weren't happy with the optimised plan - you could reject it and roll the dice again, giving some control to the operator.

Location awareness is on the to-do list for riak-core-lite, perhaps that team may have some thoughts on this.

I will add a link to the PR into the slack channel as well, to try and canvas some broader feedback.

systream commented 3 years ago

Cool, thanks.

There is something what i'd like to clarify. If you pick wrong node count or set unbalanced locations across the nodes it could not guarantee, it is true. (I think this is normal, warning message is a great idea).

Example:

---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     dev1@127.0.0.1 |valid |  up   | 31.3|  --   | loc_A  |
| (C) dev2@127.0.0.1 |valid |  up   | 37.5|  --   | loc_B  |
|     dev3@127.0.0.1 |valid |  up   | 31.3|  --   | loc_C  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected

First and last partitions are on the same node so it could a be a problem, but anyway it is a problem.

(dev1@127.0.0.1)8> riak_core_location:print_ring_with_location().
'dev2@127.0.0.1' ("loc_B")       0
'dev3@127.0.0.1' ("loc_C")       91343852333181432387730302044767688728495783936
'dev1@127.0.0.1' ("loc_A")       182687704666362864775460604089535377456991567872
'dev2@127.0.0.1' ("loc_B")       274031556999544297163190906134303066185487351808
'dev3@127.0.0.1' ("loc_C")       365375409332725729550921208179070754913983135744
'dev1@127.0.0.1' ("loc_A")       456719261665907161938651510223838443642478919680
'dev2@127.0.0.1' ("loc_B")       548063113999088594326381812268606132370974703616
'dev3@127.0.0.1' ("loc_C")       639406966332270026714112114313373821099470487552
'dev1@127.0.0.1' ("loc_A")       730750818665451459101842416358141509827966271488
'dev2@127.0.0.1' ("loc_B")       822094670998632891489572718402909198556462055424
'dev3@127.0.0.1' ("loc_C")       913438523331814323877303020447676887284957839360
'dev1@127.0.0.1' ("loc_A")       1004782375664995756265033322492444576013453623296
'dev2@127.0.0.1' ("loc_B")       1096126227998177188652763624537212264741949407232
'dev3@127.0.0.1' ("loc_C")       1187470080331358621040493926581979953470445191168
'dev1@127.0.0.1' ("loc_A")       1278813932664540053428224228626747642198940975104
'dev2@127.0.0.1' ("loc_B")       1370157784997721485815954530671515330927436759040

If you have proper node count/location setup, it will work:

Example:

---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     dev1@127.0.0.1 |valid |  up   | 25.0|  --   | loc_A  |
| (C) dev2@127.0.0.1 |valid |  up   | 25.0|  --   | loc_B  |
|     dev3@127.0.0.1 |valid |  up   | 25.0|  --   | loc_C  |
|     dev4@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_D  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected
(dev1@127.0.0.1)11> riak_core_location:print_ring_with_location().
'dev2@127.0.0.1' ("loc_B")       0
'dev3@127.0.0.1' ("loc_C")       91343852333181432387730302044767688728495783936
'dev4@127.0.0.1' ("Loc_D")       182687704666362864775460604089535377456991567872
'dev1@127.0.0.1' ("loc_A")       274031556999544297163190906134303066185487351808
'dev2@127.0.0.1' ("loc_B")       365375409332725729550921208179070754913983135744
'dev3@127.0.0.1' ("loc_C")       456719261665907161938651510223838443642478919680
'dev4@127.0.0.1' ("Loc_D")       548063113999088594326381812268606132370974703616
'dev1@127.0.0.1' ("loc_A")       639406966332270026714112114313373821099470487552
'dev2@127.0.0.1' ("loc_B")       730750818665451459101842416358141509827966271488
'dev3@127.0.0.1' ("loc_C")       822094670998632891489572718402909198556462055424
'dev4@127.0.0.1' ("Loc_D")       913438523331814323877303020447676887284957839360
'dev1@127.0.0.1' ("loc_A")       1004782375664995756265033322492444576013453623296
'dev2@127.0.0.1' ("loc_B")       1096126227998177188652763624537212264741949407232
'dev3@127.0.0.1' ("loc_C")       1187470080331358621040493926581979953470445191168
'dev4@127.0.0.1' ("Loc_D")       1278813932664540053428224228626747642198940975104
'dev1@127.0.0.1' ("loc_A")       1370157784997721485815954530671515330927436759040
---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     dev1@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_A  |
| (C) dev2@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_A  |
|     dev3@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_B  |
|     dev4@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_B  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected
(dev1@127.0.0.1)13> riak_core_location:print_ring_with_location().
'dev3@127.0.0.1' ("Loc_B")       0
'dev1@127.0.0.1' ("Loc_A")       91343852333181432387730302044767688728495783936
'dev4@127.0.0.1' ("Loc_B")       182687704666362864775460604089535377456991567872
'dev2@127.0.0.1' ("Loc_A")       274031556999544297163190906134303066185487351808
'dev3@127.0.0.1' ("Loc_B")       365375409332725729550921208179070754913983135744
'dev1@127.0.0.1' ("Loc_A")       456719261665907161938651510223838443642478919680
'dev4@127.0.0.1' ("Loc_B")       548063113999088594326381812268606132370974703616
'dev2@127.0.0.1' ("Loc_A")       639406966332270026714112114313373821099470487552
'dev3@127.0.0.1' ("Loc_B")       730750818665451459101842416358141509827966271488
'dev1@127.0.0.1' ("Loc_A")       822094670998632891489572718402909198556462055424
'dev4@127.0.0.1' ("Loc_B")       913438523331814323877303020447676887284957839360
'dev2@127.0.0.1' ("Loc_A")       1004782375664995756265033322492444576013453623296
'dev3@127.0.0.1' ("Loc_B")       1096126227998177188652763624537212264741949407232
'dev1@127.0.0.1' ("Loc_A")       1187470080331358621040493926581979953470445191168
'dev4@127.0.0.1' ("Loc_B")       1278813932664540053428224228626747642198940975104
'dev2@127.0.0.1' ("Loc_A")       1370157784997721485815954530671515330927436759040
---- Cluster Status ----
Ring ready: true

+--------------------+------+-------+-----+-------+--------+
|        node        |status| avail |ring |pending|location|
+--------------------+------+-------+-----+-------+--------+
|     dev1@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_A  |
| (C) dev2@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_B  |
|     dev3@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_A  |
|     dev4@127.0.0.1 |valid |  up   | 25.0|  --   | Loc_B  |
+--------------------+------+-------+-----+-------+--------+

Key: (C) = Claimant; availability marked with '!' is unexpected
(dev1@127.0.0.1)14> riak_core_location:print_ring_with_location().
'dev2@127.0.0.1' ("Loc_B")       0
'dev1@127.0.0.1' ("Loc_A")       91343852333181432387730302044767688728495783936
'dev4@127.0.0.1' ("Loc_B")       182687704666362864775460604089535377456991567872
'dev3@127.0.0.1' ("Loc_A")       274031556999544297163190906134303066185487351808
'dev2@127.0.0.1' ("Loc_B")       365375409332725729550921208179070754913983135744
'dev1@127.0.0.1' ("Loc_A")       456719261665907161938651510223838443642478919680
'dev4@127.0.0.1' ("Loc_B")       548063113999088594326381812268606132370974703616
'dev3@127.0.0.1' ("Loc_A")       639406966332270026714112114313373821099470487552
'dev2@127.0.0.1' ("Loc_B")       730750818665451459101842416358141509827966271488
'dev1@127.0.0.1' ("Loc_A")       822094670998632891489572718402909198556462055424
'dev4@127.0.0.1' ("Loc_B")       913438523331814323877303020447676887284957839360
'dev3@127.0.0.1' ("Loc_A")       1004782375664995756265033322492444576013453623296
'dev2@127.0.0.1' ("Loc_B")       1096126227998177188652763624537212264741949407232
'dev1@127.0.0.1' ("Loc_A")       1187470080331358621040493926581979953470445191168
'dev4@127.0.0.1' ("Loc_B")       1278813932664540053428224228626747642198940975104
'dev3@127.0.0.1' ("Loc_A")       1370157784997721485815954530671515330927436759040
martinsumner commented 3 years ago

I don't know if you're seeing the CI failures, but there are dialyzer issues:

src/riak_core_cluster_cli.erl
  40: Function register_cli/0 has no local return
  44: Function register_all_usage/0 has no local return
  54: Function register_all_commands/0 will never be called
  63: Function status_register/0 will never be called
 150: Function partition_count_register/0 will never be called
 183: Function partitions_register/0 will never be called
 231: Function partition_register/0 will never be called
 290: Function location_register/0 will never be called
===> Warnings written to /home/travis/build/basho/riak_core/_build/default/22.3.dialyzer_warnings
===> Warnings occurred running dialyzer: 8

The next stage is riak_test testing. There is a test group defined for core tests. If you've not used riak_test before there's some setup instructions.

There hasn't been any positive feedback via slack etc wrt pushing for this change. There is interest in a rack-awareness with stricter guarantees - however I think any complete solution is going to be such a radical change as to be unrealistic, compared to this relatively simple enhancement.

There is no firm process for getting something like this approved at the moment. I can see positive aspects of the change, but would want a broader consensus behind it. @martincox do you have an opinion?

systream commented 3 years ago

Sorry, dialyzer issues fixed. I can accept that it might not the best solution. We have strong security prescription, so i can't tell much about how we use riak, but to be able to scale without increasing the n_val we need a feature something like that. I'd like to help to make rack awareness happen. :)

systream commented 3 years ago

Put riak_core in _checkouts and I did run the tests:

That's 100.0% for those keeping score
making summary
bucket_props_roundtrip-bitcask: pass
bucket_props_validation-bitcask: pass
bucket_types-bitcask: pass
http_bucket_types-bitcask: pass
cluster_meta_basic-bitcask: pass
cluster_meta_rmr-bitcask: pass
gh_riak_core_154-bitcask: pass
gh_riak_core_155-bitcask: pass
gh_riak_core_176-bitcask: pass
http_bucket_types-bitcask: pass
verify_build_cluster-bitcask: pass
verify_claimant-bitcask: pass
verify_dynamic_ring-bitcask: pass
verify_leave-bitcask: pass
verify_reset_bucket_props-bitcask: pass
verify_riak_lager-bitcask: pass

Before I write more tests, it would be nice to know whether to continue with this solution or not.

martincox commented 3 years ago

I think it sounds like a good starting point, even with the caveat over a lack of strict guarantees. Potentially could be developed further in the future, if anyone can invest more time into it? Makes sense to me to adopt it as is, it looks to provide some improvement. I'd be happy to see it merged. 👍

martinsumner commented 3 years ago

@systream just a note on timescales. I'm going to proceed with release 3.0.4 this week without this change, as I need something fixed ASAP, but once we have a riak_test we're satisfied with, I will set off the process for producing a Riak 3.0.5 which includes this rack awareness PR.

I have a few other things on for the next few days, but I might have some time towards the end of next week to help with a riak_test test if required.

systream commented 3 years ago

Okay, cool thanks! I have already started writing new tests in the rack_test, but unfortunately I had urgent tasks to finish. I think it will be finished it in a couple of days.

systream commented 3 years ago

@martinsumner I wrote tests in riak test. I'm not totally sure is it appropriate, could you check it. Thanks. https://github.com/basho/riak_test/pull/1353.

martinsumner commented 3 years ago

I will have a look later in the week. Thanks.

martinsumner commented 3 years ago

I think I'm fine with this from a test and code perspective now. Sorry for the delay, I've been distracted with other things.

@systream - for others to use it, a write-up would be helpful (just some markdown in docs/). Is this something you could do?

@martincox - anything to add?

systream commented 3 years ago

Sure, I will do the docs

systream commented 3 years ago

What's the next step? Is there anything i need to do?

martinsumner commented 3 years ago

Sorry, it is on me now. I will start work on finalising a release next week. Apologies for the delay.

systream commented 3 years ago

Sorry, I don't want you to rush, just got the feeling that i might need to do something.

martinsumner commented 3 years ago

@systream - sorry, but some late suggestions for the docs.

I think the instructions need to be extended to make it clear that locations need to be allocated sequentially in order for the claim to have the desired affect. That is to say:

node1 -> loc1 node2 -> loc2 node3 -> loc3 node4 -> loc1 node5 -> loc2 node6 -> loc3

should provide a diverse spread across locations, but the following mapping will not:

node1 -> loc1 node2 -> loc1 node3 -> loc2 node4 -> loc2 node5 -> loc3 node6 -> loc3

Also in order to check a planned, but uncommitted cluster change, this script would be useful:

PlannedRing = element(1, lists:last(element(3, riak_core_claimaint:plan())).
riak_core_location:check_ring(PlannedRing, 3, 2).
systream commented 3 years ago

It doesn't (or shouldn't) matter which node is where, as long as the numbers of distinct locations are good. Both of the node location assignment should work. I did check it.

martinsumner commented 3 years ago

Perhaps I did something wrong, but I've just checked it - and not allocating nodes in sequence caused an issue:

$ dev/dev6/riak/bin/riak admin cluster plan
=============================== Staged Changes ================================
Action         Details(s)
-------------------------------------------------------------------------------
set-location  'dev1@127.0.0.1' to rack_a
set-location  'dev2@127.0.0.1' to rack_a
set-location  'dev3@127.0.0.1' to rack_b
set-location  'dev4@127.0.0.1' to rack_b
set-location  'dev5@127.0.0.1' to rack_c
set-location  'dev6@127.0.0.1' to rack_c
-------------------------------------------------------------------------------

NOTE: Applying these changes will result in 1 cluster transition

###############################################################################
                         After cluster transition 1/1
###############################################################################

================================= Membership ==================================
Status     Ring    Pending    Node
-------------------------------------------------------------------------------
valid     100.0%     17.2%    dev1@127.0.0.1 (rack_a)
valid       0.0%     15.6%    dev2@127.0.0.1 (rack_a)
valid       0.0%     17.2%    dev3@127.0.0.1 (rack_b)
valid       0.0%     15.6%    dev4@127.0.0.1 (rack_b)
valid       0.0%     17.2%    dev5@127.0.0.1 (rack_c)
valid       0.0%     17.2%    dev6@127.0.0.1 (rack_c)
-------------------------------------------------------------------------------
Valid:6 / Leaving:0 / Exiting:0 / Joining:0 / Down:0

WARNING: Not all replicas will be on distinct locations

Transfers resulting from cluster changes: 53
  10 transfers from 'dev1@127.0.0.1' to 'dev2@127.0.0.1'
  10 transfers from 'dev1@127.0.0.1' to 'dev4@127.0.0.1'
  11 transfers from 'dev1@127.0.0.1' to 'dev6@127.0.0.1'
  11 transfers from 'dev1@127.0.0.1' to 'dev3@127.0.0.1'
  11 transfers from 'dev1@127.0.0.1' to 'dev5@127.0.0.1'

$ dev/dev6/riak/bin/riak admin cluster commit
Cluster changes committed
Eshell V10.7  (abort with ^G)
(dev6@127.0.0.1)1> {ok, R} = riak_core_ring_manager:get_my_ring().
(dev6@127.0.0.1)3> length(riak_core_location:check_ring(R, 3, 2)).
40

Whereas a sequential allocation had no such violations.

I'm going to have a play around to see if I can see how this might not be working for me.

martinsumner commented 3 years ago

Hang on ... I think I might have prematurely checked before transfers were complete!

Sorry, my mistake. That was the case.