After a network split, a node can make a write progress and end-up with a diverged local seqno

bogdando commented 8 years ago

The details was given here https://bugs.launchpad.net/codership-mysql/+bug/1583521

temeo commented 8 years ago

Hi, I took a look at the logs and noticed that view ids jump backwards in logs:

2016-05-18 09:06:00 12954 [Note] WSREP: New cluster view: global state: e71ec250
-1919-11e6-9f84-be5f61687f0f:92582, view# 99: Primary, number of nodes: 5, my in
dex: 4, protocol version 3
...
2016-05-18 09:12:20 17875 [Note] WSREP: New cluster view: global state: e71ec250
-1919-11e6-9f84-be5f61687f0f:92604, view# 2: Primary, number of nodes: 4, my index: 3, protocol version 3

This suggests that cluster may have been rebootstrapped between test runs. Jepsen seems to start a cluster so that the primary node is given --wsrep-new-cluster option, which in turn bootstraps a new cluster. If nodes were shut down in the order n1, n2, ... n5 so that n1 was not up to date with n5 (n5 was the most advanced in the cluster, n1 was in non-primary state), restarting nodes in order of n1, n2, ... n5 makes n5 to detect data divergence when it tries to join the cluster.

While this kind of data divergence is real, it is not result of replication protocol malfunction but rather of the way how the cluster is managed (use of --wsrep-new-cluster) and current limitations of galera cluster management.

While I can't be certain from the logs if this is the case, it is the most probable explanation. Also galera provider version used in tests is rather old, 3.8, while the most recent release is 3.16.

bogdando commented 8 years ago

Note, I modified Jepsen test to rely on external cluster bootsrapping, which is driven by a Pacemaker OCF RA. It searches for a seed node (the one shall be starting with a --wsrep-new-cluster) as the one who has:

the most seen of UUIDs across a visible network partition with a Pacemaker quorum (so it doesn't in the minority partition), then
the max value of SEQNO across those who have that UUID. Note, that the Galera network partitions always aligned (by nodes) with a Pacemaker ones, by the test env layout. This means, that Galera perhaps must handle connections made from clients to nodes w/o write quorum so they never has made a progress to a minority partition nodes.

Although, the Pacemaker cluster's DC node may be NOT aligned with the Galera PRIMARY. I mean could the prim node belong to the minority partition? Do you think that was the case of the "ids jump", do you see such a pattern in logs?.. Do you think this issue has nothing to the Galera, but a bootstrap (or a seed/DC worldview of clusters) specific? Which way to search for a seed node do you recommend?

codership / galera

After a network split, a node can make a write progress and end-up with a diverged local seqno #401