codership / galera

Synchronous multi-master replication library
GNU General Public License v2.0
448 stars 176 forks source link

gcs protocol violation with JOIN and SYNC messages #101

Open temeo opened 10 years ago

temeo commented 10 years ago

Setup: Two nodes (db1, db2) and garbd. Node db1 was joined and got SST from db2. After SST was finished garbd logged:

WARN: Protocol violation. JOIN message sender 1.0 (db2) is not in state transfer (SYNCED). Message ignored.
WARN: Rejecting JOIN message from 2.0 (db1): new State Transfer required.
WARN: SYNC message sender from non-JOINED 2.0 (db1). Ignored.

but no such a messages were found from db2 log and db1 join was successful.

The issue is why garbd had a wrong view about db1 state.

dirtysalt commented 10 years ago

Seems garbd has wrong view of db1 and db2 state.

Rejecting JOIN message from 2.0 (db1): new State Transfer required.

db1 should be in JOINER state, but garbd thinks db1 is in PRIM state.

JOIN message sender 1.0 (db2) is not in state transfer (SYNCED)

db2 should be in DONOR state, but garbd thinks db2 is in SYNCED state.

ronin13 commented 10 years ago

Noticed same without garbd in a recent test:

2014-10-08 21:43:58 21085 [Note] WSREP: Signalling provider to continue.
2014-10-08 21:43:58 21085 [Note] WSREP: inited wsrep sidno 1
2014-10-08 21:43:58 21085 [Note] WSREP: SST received: e8fb9b5a-f7d6-11e3-813d-376a55693bec:8277710506
2014-10-08 21:43:58 21085 [Note] WSREP: Receiving IST: 58780 writesets, seqnos 8277710506-8277769286
2014-10-08 21:43:58 21085 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.21-69.0-56'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  Percona XtraDB Cluster (GPL), Release rel69.0, Revision 909, WSREP version 25.8, wsrep_25.8.r4126
2014-10-08 21:43:59 21085 [Note] WSREP: (9869c3e6, 'tcp://0.0.0.0:4567') turning message relay requesting off
2014-10-08 21:44:17 21085 [Note] WSREP: IST received: e8fb9b5a-f7d6-11e3-813d-376a55693bec:8277769286
2014-10-08 21:44:17 21085 [Note] WSREP: 0.0 (nuc4): State transfer from 1.0 (nuc3) complete.
2014-10-08 21:44:17 21085 [Note] WSREP: Shifting JOINER -> JOINED (TO: 8277780572)
2014-10-08 21:44:17 21085 [Warning] WSREP: Protocol violation. JOIN message sender 1.0 (nuc3) is not in state transfer (SYNCED). Message ignored.
2014-10-08 21:44:22 21085 [Note] WSREP: Member 0.0 (nuc4) synced with group.
2014-10-08 21:44:22 21085 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 8277783013)
2014-10-08 21:44:22 21085 [Note] WSREP: Synchronized with group, ready for connections
2014-10-08 21:44:22 21085 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
dirtysalt commented 10 years ago

related to #106

philip-galera commented 9 years ago

Saw this when the garbd test is run parallel to other workload:

galera_3nodes.galera_garbd               w3 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2015-03-31 03:19:14
line
2015-03-31 13:18:48 41508 [Warning] WSREP: Protocol violation. JOIN message sender 1.0 (localhost.localdomain) is not in state transfer (SYNCED). Message ignored.
^ Found warnings in /home/philips/git/mysql-wsrep-parallel/mysql-test/var/3/log/mysqld.1.err
ok
line
2015-03-31 13:18:48 41599 [Warning] WSREP: Protocol violation. JOIN message sender 1.0 (localhost.localdomain) is not in state transfer (SYNCED). Message ignored.
^ Found warnings in /home/philips/git/mysql-wsrep-parallel/mysql-test/var/3/log/mysqld.2.err
ok
line
2015-03-31 13:18:48 41929 [Warning] WSREP: Protocol violation. JOIN message sender 1.0 (localhost.localdomain) is not in state transfer (SYNCED). Message ignored.
^ Found warnings in /home/philips/git/mysql-wsrep-parallel/mysql-test/var/3/log/mysqld.3.err
ok

 - saving '/home/philips/git/mysql-wsrep-parallel/mysql-test/var/3/log/galera_3nodes.galera_garbd/' to '/home/philips/git/mysql-wsrep-parallel/mysql-test/var/log/galera_3nodes.galera_garbd/'
ronin13 commented 9 years ago

Yes, these protocol violations are still logged during tests on our end as well. Some of the logs attached to other bugs do have them.

Not all messages are identical though.

So, is this indicative of any bug or harmless always. Asking for latter because I tend to see this in conjunction with other bugs.

ayurchen commented 9 years ago

This is inevitable as long as different stack layers have asynchronous state machines. So it will show under stress. This is logged as something to look for. But its significance depends on the actual situation. For example, above, there is a stray JOIN message from a node that's already SYNCED. So it may be a simple redundancy, which is nothing. Or that everybody erroneously thinks that it is SYNCED, while it is not - and then it is a serious issue. That message by itself is not a statement of what is going on. It is an additional info.