codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
448 stars 176 forks source link

Node eviction prevents primary component formation #194

Open ronin13 opened 9 years ago

ronin13 commented 9 years ago

In one of the healthy nodes I see this message:

"Nodes 71518ee5 c7973d37 are still in unknown state, unable to rebootstrap new prim"

However, both of those are already evicted by that time.

I also see;

2014-12-03T16:31:19.343942455Z 2014-12-03 16:31:19 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 71518ee5 (tcp://172.17.0.83:4567), attempt 60
2014-12-03T16:31:31.844531795Z 2014-12-03 16:31:31 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 7f83d198 (tcp://172.17.0.86:4567), attempt 120
2014-12-03T16:31:36.345113914Z 2014-12-03 16:31:36 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to d6b6fd86 (tcp://172.17.0.94:4567), attempt 150
2014-12-03T16:31:45.345643080Z 2014-12-03 16:31:45 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to c7973d37 (tcp://172.17.0.93:4567), attempt 30
2014-12-03T16:32:24.348411790Z 2014-12-03 16:32:24 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 71518ee5 (tcp://172.17.0.83:4567), attempt 90
2014-12-03T16:32:29.848931488Z 2014-12-03 16:32:29 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 7901cf91 (tcp://172.17.0.85:4567), attempt 180
2014-12-03T16:32:37.849453471Z 2014-12-03 16:32:37 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 7f83d198 (tcp://172.17.0.86:4567), attempt 150
2014-12-03T16:32:49.850588765Z 2014-12-03 16:32:49 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to c7973d37 (tcp://172.17.0.93:4567), attempt 60
2014-12-03T16:32:55.850898878Z 2014-12-03 16:32:55 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to d6b6fd86 (tcp://172.17.0.94:4567), attempt 180
2014-12-03T16:33:33.853656729Z 2014-12-03 16:33:33 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 71518ee5 (tcp://172.17.0.83:4567), attempt 120
2014-12-03T16:33:48.855150447Z 2014-12-03 16:33:48 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 7f83d198 (tcp://172.17.0.86:4567), attempt 180
2014-12-03T16:33:51.855395537Z 2014-12-03 16:33:51 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 7901cf91 (tcp://172.17.0.85:4567), attempt 210
2014-12-03T16:34:01.856537649Z 2014-12-03 16:34:01 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to c7973d37 (tcp://172.17.0.93:4567), attempt 90

on other nodes, long after the pinged nodes have been evicted.

The configuration is:

13 nodes, out of which 5 nodes are selected
40 % loss
Delay 150ms 20ms distribution pareto (adjusted with gmcast.segment later)

Note that, here, sysbench is also modified to reconnect/retry on failures - 1047,1213

Final qdisc rules look like:

07:44:54 Rules in place
07:44:54 qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:54 qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:54 qdisc prio 1: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:54 qdisc netem 30: dev eth0 parent 1:2 limit 1000 delay 900.0ms  20.0ms loss 40%
07:44:54 qdisc prio 1: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:54 qdisc netem 30: dev eth0 parent 1:2 limit 1000 delay 300.0ms  20.0ms loss 40%
07:44:54 qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:54 qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:54 qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:54 qdisc prio 1: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:54 qdisc netem 30: dev eth0 parent 1:2 limit 1000 delay 300.0ms  20.0ms loss 40%
07:44:54 qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:54 qdisc prio 1: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:54 qdisc netem 30: dev eth0 parent 1:2 limit 1000 delay 900.0ms  20.0ms loss 40%
07:44:55 qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:55 qdisc prio 1: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
07:44:55 qdisc netem 30: dev eth0 parent 1:2 limit 1000 delay 300.0ms  20.0ms loss 40%
07:44:55 qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1

Console: http://jenkins.percona.com/job/PXC-5.6-netem/235/btype=release,label_exp=qaserver-04/console

Logs: https://files.wnohang.net/files/results-235.tar.gz https://files.wnohang.net/files/results-234.tar.gz

Now, from gcomm::pc::Proto::is_prim():

    // First determine if there are any nodes still in unknown state.
    std::set<UUID> un;
    for (NodeMap::const_iterator i(instances_.begin());
         i != instances_.end(); ++i)
    {
        if (NodeMap::value(i).un() == true &&
            current_view_.members().find(NodeMap::key(i)) ==
            current_view_.members().end())
        {
            un.insert(NodeMap::key(i));
        }
    }

    if (un.empty() == false)
    {
        std::ostringstream oss;
        std::copy(un.begin(), un.end(),
                  std::ostream_iterator<UUID>(oss, " "));
        log_info << "Nodes " << oss.str() << "are still in unknown state, "
                 << "unable to rebootstrap new prim";
        return false;
    }

it looks like it is checking for nodes in 'unknown' state and after that it is checking for evicted nodes in:

    // Collect last prim members and evicted from state messages
    MultiMap<ViewId, UUID> last_prim_uuids;
    std::set<UUID> evicted;

    for (SMMap::const_iterator i = state_msgs_.begin();
         i != state_msgs_.end();
         ++i)
    {
        for (NodeMap::const_iterator
                 j = SMMap::value(i).node_map().begin();
             j != SMMap::value(i).node_map().end(); ++j)
        {
            const UUID& uuid(NodeMap::key(j));
            const Node& inst(NodeMap::value(j));

            if (inst.last_prim().type() != V_NON_PRIM &&
                std::find<MultiMap<ViewId, UUID>::iterator,
                          std::pair<const ViewId, UUID> >(
                              last_prim_uuids.begin(),
                              last_prim_uuids.end(),
                              std::make_pair(inst.last_prim(), uuid)) ==
                last_prim_uuids.end())
            {
                last_prim_uuids.insert(std::make_pair(inst.last_prim(), uuid));
            }
            if (inst.evicted() == true)
            {
                evicted.insert(uuid);
            }
        }
    }

    if (last_prim_uuids.empty() == true)
    {
        log_warn << "no nodes coming from prim view, prim not possible";
        return false;
    }

So, either evicted nodes be not considered in 'unknown' state or some wait is required to avoid race condition.

ronin13 commented 9 years ago

Note that, eviction happens correctly here - ie. all the bad nodes are correctly evicted.

Just that other nodes don't form a PC even when 8 are left in the end.

Nodes are as follows:

Segment 0: Dock13, Dock1 Segment 1: Dock4, Dock8, Dock9, Dock12 2: Dock6, Dock7, Dock11 3: 4: Dock5 5: Dock2, Dock3, Dock10

The delay here is calculated as: (original-delay * (segment + 1)) for every node, hence from qdisc you can see it ranges from 300ms to 900ms.

ronin13 commented 9 years ago

One more thing:

2014-12-03T16:31:31.844531795Z 2014-12-03 16:31:31 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 7f83d198 (tcp://172.17.0.86:4567), attempt 120
2014-12-03T16:31:36.345113914Z 2014-12-03 16:31:36 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to d6b6fd86 (tcp://172.17.0.94:4567), attempt 150
2014-12-03T16:31:45.345643080Z 2014-12-03 16:31:45 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to c7973d37 (tcp://172.17.0.93:4567), attempt 30
2014-12-03T16:32:24.348411790Z 2014-12-03 16:32:24 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 71518ee5 (tcp://172.17.0.83:4567), attempt 90
2014-12-03T16:32:29.848931488Z 2014-12-03 16:32:29 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 7901cf91 (tcp://172.17.0.85:4567), attempt 180
2014-12-03T16:32:37.849453471Z 2014-12-03 16:32:37 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 7f83d198 (tcp://172.17.0.86:4567), attempt 150
2014-12-03T16:32:49.850588765Z 2014-12-03 16:32:49 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to c7973d37 (tcp://172.17.0.93:4567), attempt 60

gmcast tries reconnecting to nodes long after they have been evicted. This shouldn't be done.