Open ronin13 opened 9 years ago
Note that, eviction happens correctly here - ie. all the bad nodes are correctly evicted.
Just that other nodes don't form a PC even when 8 are left in the end.
Nodes are as follows:
Segment 0: Dock13, Dock1 Segment 1: Dock4, Dock8, Dock9, Dock12 2: Dock6, Dock7, Dock11 3: 4: Dock5 5: Dock2, Dock3, Dock10
The delay here is calculated as: (original-delay * (segment + 1)) for every node, hence from qdisc you can see it ranges from 300ms to 900ms.
One more thing:
2014-12-03T16:31:31.844531795Z 2014-12-03 16:31:31 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 7f83d198 (tcp://172.17.0.86:4567), attempt 120
2014-12-03T16:31:36.345113914Z 2014-12-03 16:31:36 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to d6b6fd86 (tcp://172.17.0.94:4567), attempt 150
2014-12-03T16:31:45.345643080Z 2014-12-03 16:31:45 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to c7973d37 (tcp://172.17.0.93:4567), attempt 30
2014-12-03T16:32:24.348411790Z 2014-12-03 16:32:24 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 71518ee5 (tcp://172.17.0.83:4567), attempt 90
2014-12-03T16:32:29.848931488Z 2014-12-03 16:32:29 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 7901cf91 (tcp://172.17.0.85:4567), attempt 180
2014-12-03T16:32:37.849453471Z 2014-12-03 16:32:37 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to 7f83d198 (tcp://172.17.0.86:4567), attempt 150
2014-12-03T16:32:49.850588765Z 2014-12-03 16:32:49 7 [Note] WSREP: (ad0355f3, 'tcp://0.0.0.0:4567') reconnecting to c7973d37 (tcp://172.17.0.93:4567), attempt 60
gmcast tries reconnecting to nodes long after they have been evicted. This shouldn't be done.
In one of the healthy nodes I see this message:
However, both of those are already evicted by that time.
I also see;
on other nodes, long after the pinged nodes have been evicted.
The configuration is:
Note that, here, sysbench is also modified to reconnect/retry on failures - 1047,1213
Final qdisc rules look like:
Console: http://jenkins.percona.com/job/PXC-5.6-netem/235/btype=release,label_exp=qaserver-04/console
Logs: https://files.wnohang.net/files/results-235.tar.gz https://files.wnohang.net/files/results-234.tar.gz
Now, from gcomm::pc::Proto::is_prim():
it looks like it is checking for nodes in 'unknown' state and after that it is checking for evicted nodes in:
So, either evicted nodes be not considered in 'unknown' state or some wait is required to avoid race condition.