Improve gcomm operation in presence of high packet loss

temeo commented 10 years ago

This will be parent ticket for further packet loss related work.

Original report: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1274192

Introducing high delay and packet loss into network will make inter node communication highly unreliable in form of duplicated, lost or delayed messages. These kind of conditions will bring up some EVS related bugs, like currently open #37 and #40. EVS protocol related bugs are not in the scope of this ticket and should be reported and fixed separately.

Further work can be divided roughly in three parts, which will be outlined below.

1. Monitoring and manual eviction

Automatic node isolation or eviction is currently not possible since EVS lacks some necessary elements like proper per-node statistics collection and protocol to communicate current view of node states without running full membership protocol. Running full membership protocol in unstable network to should be avoided as it may result in oscillating membership if the network conditions are difficult enough. Therefore the first implementation of node eviction should be based on monitoring and manual eviction until better understanding about automatic eviction has been gained.

Proposed implementation:

Expose list of delayed node UUIDs and communication endpoints as wsrep status variables for monitoring purposes
Implement evs.evict wsrep provider option taking list of UUIDs that should be evicted manually from cluster
2. Join time health check

Joining a node with bad network connection is problematic since join operation will start EVS membership protocol round which in turn performs poorly in unstable network. To avoid starting join operation over unstable network, additional health check phase for GMCast should be deviced.

When GComm protocol stack is started, joiner should first connect to all known peers in GMCast network and exchange keep alive packets to verify that the network is ok. Joining node will start upper GComm protocol layers only after health check passes. Other nodes should not treat joining node as fully qualified member in the GMCast network until joiner sends first upper level protocol packet.

3. Automatic eviction

After enough understanding has been gained about how to properly identify the node that is causing turbulence for group communication, automatic eviction protocol can be enabled. This work will require proper per-node statistics collection on EVS level and protocol extension to communicate statistics related view to other nodes.

temeo commented 10 years ago

Outline of current design:

EVS monitors response time for each other node. If the response time exceeds evs.delayed_period, node is added to list of delayed nodes.

The node on list of delayed nodes has associated state (OK or DELAYED, initialized to DELAYED) and counter (0...255, initialzed to 0). Each time check for delayed nodes is done (once in evs.inactive_check_period) and state change is detected counter is incremented. Once counter reaches over 1, node UUID and counter is reported to other nodes.

If node state stays OK over evs.delayed_decay_period, its counter is decremented by one. Once the counter reaches zero, node is removed from delayed list.

Delayed list can be monitored via wsrep_evs_delayed status variable. List is in form of comma separated values, with single element being of format

  0aac8e2a-0379-11e4-ba9c-afbf6afe14bc:tcp://192.168.17.13:10031:1

The first part is node UUID. Second part starting from tcp:// is low level connectivity endpoint (if known) and the final number is the value of state counter.

Node can be manually evicted from the cluster by assigning node UUID into evs.evict wsrep provider option:

set global wsrep_provider_options='evs.evict=0aac8e2a-0379-11e4-ba9c-afbf6afe14bc';

This will add node UUID to evicted list and will trigger group membership change.

At the moment it is possible to evict manually only one node at the time. This will cause problems if several nodes reside behind bad link (like in multi data center case) since membership protocol runs poorly over bad network. Parameter evs.evict parsing should be enhanced to allow several UUIDs at once.

Setting parameter evs.auto_evict to value greater than zero will enable automatic eviction. The operation is roughly the following:

Every node listens to messages reporting delayed nodes. When such a message is received, it is stored for evs.decay_period time.

Messages are iterated over and the following counters are updated

How many messages have a node reported with counter value over evs.auto_evict
How many messages have a node reported Messages containing processing node's UUID are ignored.

If at least one candidate is found, all nodes that have been reported by majority of the group are evicted automatically.

Also this approach has a problem if there are several nodes behind bad link. Some kind of heuristics should be applied to either try to keep group intact or to evict all delayed candidates without loosing majority of the group.

This kind of design emerged because:

There must be some way to gather statistics to make eviction decision, and doing it along with inactive check was the easies choise.
Simply having OK and DELAYED states for nodes to be evicted would not be enough, since eviction would happen with each partition and this would defeat the purpose of partition tolerant algorithm. Therefore the state change counter is used as a measure if the communication is ok (state OK and counter 0, or node not in list), completely failed (state DELAYED and counter 0) or acting randomly (counter over 1).
Slowly decrementing state counter was chosen to keep the entries with most state changes longer on the list.
Auto eviction algorithm aims to always keep majority of the group live or to avoid eviction. Otherwise cluster would disintegrate quickly by cross auto eviction if the connectivity between all nodes would have problems.

Suggested parameter values for testing with default EVS values:

evs.delayed_period=PT2S
evs.delayed_decay_period=PT30S
evs.auto_evict=5

ayurchen commented 10 years ago

On 2014-07-04 16:59, temeo wrote:

Suggested parameter values for testing with default EVS values:
evs.delayed_period=PT2S

But not 'period'! Just 'evs.delay' or evs.response_delay.

2 seconds default is probably too much. 1 second should be more than enough.

Also, could this be linked to suspect_timeout?

evs.delayed_decay_period=PT30S evs.auto_evict=5

temeo commented 10 years ago

@ayurchen Evs response delay should be higher than keepalive period, otherwise there will be false positives in idle cluster.

Linking it to suspect timeout would be nice but it would be hard to define good default for that.

ronin13 commented 10 years ago

This is still reproducible:

Log files:

http://files.wnohang.net/files/results-140.tar.gz

Console: http://jenkins.percona.com/job/PXC-5.6-netem/140/label_exp=qaserver-04/console

Also here: http://files.wnohang.net/files/results-140-consoleText

The test is as follows:

a) Start one node.

b) Load 20 tables into it with sysbench, 1000 records each.

c) Start 9 other nodes - each with a random segment in [1,5]. - gmcast.segment

d) Make one node have "150ms 20ms distribution normal and loss 20% ".

e) Select 9 other nodes to write to. (sockets in "sysbench on sockets")

f) Start writing with sysbench - oltp test, 20 threads, 360 seconds, --oltp_index_updates=20 --oltp_non_index_updates=20 --oltp_distinct_ranges=15

g) sysbench exits with error due to network partitioning. All nodes except one with loss are in non-PC.

h) Sleep for 60 seconds and then run sanity tests.

i) The sanity tests also fail and test exits.

With last few commits, the point to network partitioning has increased (it is stable for upto 230 seconds as seen in console whereas earlier it may have been failing much earlier).

Now, what is the general design of solution expected here? Is it to isolate the node with bad network - STONITH - implemented in galera? Is it to mark it as down etc. If this is the case, as soon as the node is isolated, the cluster should return to normal, but it is not happening.

Also, note that specifically, the writes are done to all nodes except the one maimed for loss. This is to avoid sysbench itself getting affected by network.

ronin13 commented 10 years ago

Note that this is upto c5353b1 in galera-3.x tree.

dirtysalt commented 10 years ago

I think gh71 is supposed to fix network partitioning problem, but is not merged into 3.x yet.

ronin13 commented 10 years ago

With gh71 merged, I am seeing 3 nodes evicted in lieu of 1 node with packet loss.

Full logs: http://files.wnohang.net/files/results-142.tar.gz Console: http://jenkins.percona.com/job/PXC-5.6-netem/142/label_exp=qaserver-04/console

Dock3.log:[Aug 16 19:56:11.002] 2014-08-16 20:56:11 1 [Warning] WSREP: handling gmcast protocol message failed: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
Dock3.log:[Aug 16 19:56:11.002] 2014-08-16 20:56:11 1 [ERROR] WSREP: exception from gcomm, backend must be restarted: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
Dock4.log:[Aug 16 19:55:56.474] 2014-08-16 20:55:56 1 [Warning] WSREP: handling gmcast protocol message failed: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
Dock4.log:[Aug 16 19:55:56.474] 2014-08-16 20:55:56 1 [ERROR] WSREP: exception from gcomm, backend must be restarted: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
Dock5.log:[Aug 16 19:55:56.044] 2014-08-16 20:55:56 1 [Warning] WSREP: handling gmcast protocol message failed: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)
Dock5.log:[Aug 16 19:55:56.045] 2014-08-16 20:55:56 1 [ERROR] WSREP: exception from gcomm, backend must be restarted: this node has been evicted out of the cluster, gcomm backend restart is required (FATAL)

ronin13 commented 10 years ago

The earlier logs were with multiple segments.

http://files.wnohang.net/files/results-145.tar.gz is with 0 segment for all.

In this, Dock3 and Dock10 get evicted, even though Dock3 is the one with loss/delay.

ronin13 commented 10 years ago

Logs:

http://files.wnohang.net/files/results-154.tar.gz

Dock2 and Dock5 are evicted even if only Dock5 has the loss.

http://jenkins.percona.com/job/PXC-5.6-netem/154/label_exp=qaserver-04/console

Nodes failed to reach primary too, but the cluster lasted a bit longer before going non-primary.

ronin13 commented 10 years ago

http://files.wnohang.net/files/results-156.tar.gz http://jenkins.percona.com/job/PXC-5.6-netem/156/label_exp=qaserver-04/console

This is with evs.info_log_mask=0x3 in addition to others.

Dock9 is the one with loss, Dock1 is evicted.

dirtysalt commented 9 years ago

LGTM.

temeo commented 9 years ago

Monitoring, manual eviction and automatic eviction from original plan has been implemented. If there is still need for join time health check, it should be reported as a separate issue.

codership / galera

Improve gcomm operation in presence of high packet loss #71

1. Monitoring and manual eviction

2. Join time health check

3. Automatic eviction