autopilotpattern / mysql

Implementation of the autopilot pattern for MySQL
Mozilla Public License 2.0
172 stars 68 forks source link

Failover not reporting to consul #81

Open mbbender opened 7 years ago

mbbender commented 7 years ago

If I bring up the project fresh, then bring up a second mysql slave, then destroy the primary, the failover process appears to work technically if connecting to mysql via CNS but based on the error logs it looks like the onchange container pilot function isn't doing what is expected to inform consul of the primary change.

It's definitely getting past where it was before. I got the mysql primary to come up fine and saw replication working. However when I killed the master to test failover on the slave server I get:

Version: '5.6.34-79.1-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  Percona Server (GPL), Release 79.1, Revision 1c589f9
2017/03/12 22:58:54 INFO manage Setting up replication.
2017-03-12 22:58:54 58858 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysqld-relay-bin' to avoid this problem.
2017-03-12 22:58:54 58858 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='192.168.129.171', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2017-03-12 22:58:54 58858 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2017-03-12 22:58:54 58858 [Warning] Slave SQL: If a crash happens this configuration does not guarantee that the relay log info will be consistent, Error_code: 0
2017-03-12 22:58:54 58858 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './mysqld-relay-bin.000001' position: 4
2017-03-12 22:58:54 58858 [Note] Slave I/O thread: connected to master 'repluser@192.168.129.171:3306',replication started in log 'FIRST' at position 4
2017/03/12 22:58:54     2017/03/12 22:58:54 [ERR] http: Request PUT /v1/agent/check/pass/mysql-3709f21efad3?note=ok, error: CheckID "mysql-3709f21efad3" does not have associated TTL from=127.0.0.1:45860
2017/03/12 22:58:54 Unexpected response code: 500 (CheckID "mysql-3709f21efad3" does not have associated TTL)
Service not registered, registering...
2017/03/12 22:58:54     2017/03/12 22:58:54 [INFO] agent: Synced service 'mysql-3709f21efad3'
2017/03/12 22:58:54     2017/03/12 22:58:54 [INFO] agent: Synced check 'mysql-3709f21efad3'
2017/03/12 22:58:54     2017/03/12 22:58:54 [INFO] agent: Synced check 'mysql-3709f21efad3'
2017/03/12 23:04:39 INFO manage [on_change] Executing failover with candidates: [u'192.168.129.173']
2017-03-12 23:04:39 58858 [Note] Error reading relay log event: slave SQL thread was killed
2017-03-12 23:04:39 58858 [Note] Slave I/O thread killed while reading event
2017-03-12 23:04:39 58858 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000001', position 1209
2017/03/12 23:04:39 WARNING: Using a password on the command line interface can be insecure.
2017/03/12 23:04:39 # Checking privileges.
2017/03/12 23:04:39 # Checking privileges on candidates.
2017/03/12 23:04:39 # Performing failover.
2017/03/12 23:04:39 # Candidate slave 192.168.129.173:3306 will become the new master.
2017/03/12 23:04:39 # Checking slaves status (before failover).
2017/03/12 23:04:39 # Preparing candidate for failover.
2017/03/12 23:04:39 # Creating replication user if it does not exist.
2017/03/12 23:04:39 # Stopping slaves.
2017/03/12 23:04:39 # Performing STOP on all slaves.
2017/03/12 23:04:39 # Switching slaves to new master.
2017/03/12 23:04:39 # Disconnecting new master as slave.
2017/03/12 23:04:39 # Starting slaves.
2017/03/12 23:04:39 # Performing START on all slaves.
2017/03/12 23:04:39 # Checking slaves for errors.
2017/03/12 23:04:39 # Failover complete.
2017/03/12 23:04:39 #
2017/03/12 23:04:39 # Replication Topology Health:
2017/03/12 23:04:39 +------------------+-------+---------+--------+------------+---------+
2017/03/12 23:04:39 | host             | port  | role    | state  | gtid_mode  | health  |
2017/03/12 23:04:39 +------------------+-------+---------+--------+------------+---------+
2017/03/12 23:04:39 | 192.168.129.173  | 3306  | MASTER  | UP     | ON         | OK      |
2017/03/12 23:04:39 +------------------+-------+---------+--------+------------+---------+
2017/03/12 23:04:39 # ...done.
2017/03/12 23:04:40 ERROR manage [on_change] this node is neither primary or replica after failover; check replication status on cluster.
2017/03/12 23:04:44 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:04:45     2017/03/12 23:04:45 [INFO] memberlist: Suspect a5b06f708612 has failed, no acks received
2017/03/12 23:04:48     2017/03/12 23:04:48 [INFO] memberlist: Suspect a5b06f708612 has failed, no acks received
2017/03/12 23:04:49 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:04:50     2017/03/12 23:04:50 [INFO] serf: EventMemberFailed: a5b06f708612 192.168.129.171
2017/03/12 23:04:50     2017/03/12 23:04:50 [INFO] memberlist: Suspect a5b06f708612 has failed, no acks received
2017/03/12 23:04:54 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:04:59 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:04 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:04     2017/03/12 23:05:04 [WARN] agent: Check 'mysql-3709f21efad3' missed TTL, is now critical
2017/03/12 23:05:04     2017/03/12 23:05:04 [INFO] agent: Synced check 'mysql-3709f21efad3'
2017/03/12 23:05:09 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:14 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:19 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:24 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:29 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:34 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:39 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:44 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:48     2017/03/12 23:05:48 [INFO] serf: attempting reconnect to a5b06f708612 192.168.129.171:8301
2017/03/12 23:05:49 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:54 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:05:59 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:04 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:09 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:14 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:19 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:24 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:29 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:34 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:39 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:44 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:49 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:54 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:06:58     2017/03/12 23:06:58 [INFO] serf: attempting reconnect to a5b06f708612 192.168.129.171:8301
2017/03/12 23:06:59 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:04 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:09 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:14 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:19 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:24 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:29 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:34 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:38     2017/03/12 23:07:38 [INFO] serf: attempting reconnect to a5b06f708612 192.168.129.171:8301
2017/03/12 23:07:39 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:44 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:49 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:54 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:07:59 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:04 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:09 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:14 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:19 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:24 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:29 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:34 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:39 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:44 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:49 ERROR manage Cannot determine MySQL state; failing health check.
2017/03/12 23:08:54 ERROR manage Cannot determine MySQL state; failing health check.
tgross commented 7 years ago

@mbbender just wanted to check in on this -- I've been working heavily on ContainerPilot v3 the last few weeks so I haven't had a chance to dive back into this problem but it's on my list.

tgross commented 7 years ago

Just as an FYI, now that I've done some test/deploy improvement work to get the iteration speed up, I'm going to focus on this project for the next few weeks to get it fixed.

dperezsan commented 7 years ago

I have bumped across a similar issue. Spun up a primary and a slave and then deleted the primary mysql node to see the slave taking over. The failover process works fine behind the scenes as it can be seen in the logs but the containerpilot onChange handler seems to be failing to inform consul of the new states of both nodes.

Primary: 192.168.128.190 Replica: 192.168.128.191

...
2017/08/09 14:10:25 DEBUG manage [on_change] mysql.get_primary end: (4868, u'192.168.128.190')
2017/08/09 14:10:25 DEBUG manage [on_change] node.is_primary end: False
2017/08/09 14:10:25 DEBUG manage [on_change] consul.get_primary start
2017/08/09 14:10:25 DEBUG manage Starting new HTTP connection (1): localhost
2017/08/09 14:10:25 check.mysql exited without error
2017/08/09 14:10:25 event: {ExitSuccess check.mysql}
2017/08/09 14:10:25 check.mysql.Run end
2017/08/09 14:10:25 check.mysql.kill
2017/08/09 14:10:25 killing command 'check.mysql' at pid: 19152
2017/08/09 14:10:25 event: {StatusHealthy mysql}
2017/08/09 14:10:25 DEBUG manage http://localhost:8500 "GET /v1/health/service/mysql-primary?passing=1 HTTP/1.1" 200 2
2017/08/09 14:10:25 DEBUG manage []
2017/08/09 14:10:25 DEBUG manage [on_change] no primary from consul: No primary found
2017/08/09 14:10:25 DEBUG manage [on_change] consul.lock_failover start
2017/08/09 14:10:25 DEBUG manage [on_change] consul.get_session start
2017/08/09 14:10:25 DEBUG manage [on_change] consul.create_session start
2017/08/09 14:10:25 DEBUG manage http://localhost:8500 "PUT /v1/session/create HTTP/1.1" 200 45
2017/08/09 14:10:25 DEBUG manage [on_change] consul.create_session end: 858eb1df-9c51-8b54-7d4b-b12edffaf1c4
2017/08/09 14:10:25 DEBUG manage [on_change] consul.get_session end: 858eb1df-9c51-8b54-7d4b-b12edffaf1c4
2017/08/09 14:10:25 DEBUG manage [on_change] consul.lock start
2017/08/09 14:10:25 DEBUG manage http://localhost:8500 "PUT /v1/kv/FAILOVER_IN_PROGRESS?acquire=858eb1df-9c51-8b54-7d4b-b12edffaf1c4 HTTP/1.1" 200 4
2017/08/09 14:10:25 DEBUG manage [on_change] consul.lock end: True
2017/08/09 14:10:25 DEBUG manage [on_change] consul.lock_failover end
2017/08/09 14:10:25 DEBUG manage http://localhost:8500 "GET /v1/health/service/mysql?passing=1 HTTP/1.1" 200 802
2017/08/09 14:10:25 INFO manage [on_change] Executing failover with candidates: u'192.168.128.191']
2017/08/09 14:10:25 DEBUG manage [on_change] mysql.failover start
2017/08/09 14:10:26 2017-08-09 14:10:26 14899 [Note] Error reading relay log event: slave SQL thread was killed
2017/08/09 14:10:26 2017-08-09 14:10:26 14899 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read
2017/08/09 14:10:26 2017-08-09 14:10:26 14899 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000002', position 151
2017/08/09 14:10:26 WARNING: Using a password on the command line interface can be insecure.
2017/08/09 14:10:26 # Checking privileges.
2017/08/09 14:10:26 # Checking privileges on candidates.
2017/08/09 14:10:26 # Performing failover.
2017/08/09 14:10:26 # Candidate slave 192.168.128.191:3306 will become the new master.
2017/08/09 14:10:26 # Checking slaves status (before failover).
2017/08/09 14:10:26 # Preparing candidate for failover.
2017/08/09 14:10:26 # Creating replication user if it does not exist.
2017/08/09 14:10:26 # Stopping slaves.
2017/08/09 14:10:26 # Performing STOP on all slaves.
2017/08/09 14:10:26 # Switching slaves to new master.
2017/08/09 14:10:26 # Disconnecting new master as slave.
2017/08/09 14:10:26 # Starting slaves.
2017/08/09 14:10:26 # Performing START on all slaves.
2017/08/09 14:10:26 # Checking slaves for errors.
2017/08/09 14:10:26 # Failover complete.
2017/08/09 14:10:26 #
2017/08/09 14:10:26 # Replication Topology Health:
2017/08/09 14:10:26 +------------------+-------+---------+--------+------------+---------+
2017/08/09 14:10:26 | host             | port  | role    | state  | gtid_mode  | health  |
2017/08/09 14:10:26 +------------------+-------+---------+--------+------------+---------+
2017/08/09 14:10:26 | 192.168.128.191  | 3306  | MASTER  | UP     | ON         | OK      |
2017/08/09 14:10:26 +------------------+-------+---------+--------+------------+---------+
2017/08/09 14:10:26 # ...done.
2017/08/09 14:10:26 DEBUG manage [on_change] mysql.failover end
2017/08/09 14:10:26 DEBUG manage [on_change] node.is_primary start
2017/08/09 14:10:26 DEBUG manage [on_change] mysql.get_primary start
2017/08/09 14:10:26 DEBUG manage [on_change] mysql.query start
2017/08/09 14:10:26 DEBUG manage show slave status ()
2017/08/09 14:10:26 DEBUG manage [on_change] mysql.query end: []
2017/08/09 14:10:26 DEBUG manage [on_change] mysql.query start
2017/08/09 14:10:26 DEBUG manage show slave hosts ()
2017/08/09 14:10:26 DEBUG manage [on_change] mysql.query end: []
2017/08/09 14:10:26 DEBUG manage could not determine primary via mysqld status: no prior replication setup found
2017/08/09 14:10:26 DEBUG manage [on_change] consul.get_primary start
2017/08/09 14:10:26 DEBUG manage http://localhost:8500 "GET /v1/health/service/mysql-primary?passing=1 HTTP/1.1" 200 2
2017/08/09 14:10:26 DEBUG manage []
2017/08/09 14:10:26 DEBUG manage could not determine primary via Consul: No primary found
2017/08/09 14:10:26 DEBUG manage [on_change] consul.read_lock start
2017/08/09 14:10:26 DEBUG manage http://localhost:8500 "GET /v1/kv/mysql-primary HTTP/1.1" 200 119
2017/08/09 14:10:26 DEBUG manage [on_change] consul.read_lock end: (None, None)
2017/08/09 14:10:26 DEBUG manage [on_change] node.is_primary end: False
2017/08/09 14:10:26 DEBUG manage [on_change] node.is_primary start
2017/08/09 14:10:26 DEBUG manage [on_change] mysql.get_primary start
2017/08/09 14:10:26 DEBUG manage [on_change] mysql.query start
2017/08/09 14:10:26 DEBUG manage show slave status ()
2017/08/09 14:10:26 DEBUG manage [on_change] mysql.query end: []
2017/08/09 14:10:26 DEBUG manage [on_change] mysql.query start
2017/08/09 14:10:26 DEBUG manage show slave hosts ()
2017/08/09 14:10:26 DEBUG manage [on_change] mysql.query end: []
2017/08/09 14:10:26 DEBUG manage could not determine primary via mysqld status: no prior replication setup found
2017/08/09 14:10:26 DEBUG manage [on_change] consul.get_primary start
2017/08/09 14:10:26 DEBUG manage http://localhost:8500 "GET /v1/health/service/mysql-primary?passing=1 HTTP/1.1" 200 2
2017/08/09 14:10:26 DEBUG manage []
2017/08/09 14:10:26 DEBUG manage could not determine primary via Consul: No primary found
2017/08/09 14:10:26 DEBUG manage [on_change] consul.read_lock start
2017/08/09 14:10:26 DEBUG manage http://localhost:8500 "GET /v1/kv/mysql-primary HTTP/1.1" 200 119
2017/08/09 14:10:26 DEBUG manage [on_change] consul.read_lock end: (None, None)
2017/08/09 14:10:26 DEBUG manage [on_change] node.is_primary end: False
2017/08/09 14:10:26 ERROR manage [on_change] this node is neither primary or replica after failover; check replication status on cluster.
2017/08/09 14:10:26 onChange exited with error: onChange: exit status 1
2017/08/09 14:10:26 event: {ExitFailed onChange}
2017/08/09 14:10:26 event: {Error onChange: exit status 1}
2017/08/09 14:10:26 onChange.Run end
2017/08/09 14:10:26 onChange.kill
2017/08/09 14:10:26 killing command 'onChange' at pid: 19153
2017/08/09 14:10:30 check.mysql.Run start
2017/08/09 14:10:30 DEBUG manage [health] node.health start
2017/08/09 14:10:30 DEBUG manage [health] node.assert_initialized_for_state start
2017/08/09 14:10:30 DEBUG manage [health] node.assert_initialized_for_state end: True
2017/08/09 14:10:30 DEBUG manage [health] node.is_primary start
2017/08/09 14:10:30 DEBUG manage [health] mysql.get_primary start
2017/08/09 14:10:30 DEBUG manage [health] mysql.query start
2017/08/09 14:10:30 DEBUG manage [health] mysql.wait_for_connection start
2017/08/09 14:10:30 DEBUG manage [health] mysql.wait_for_connection end
2017/08/09 14:10:30 DEBUG manage show slave status ()
2017/08/09 14:10:30 DEBUG manage [health] mysql.query end: []
2017/08/09 14:10:30 DEBUG manage [health] mysql.query start
2017/08/09 14:10:30 DEBUG manage show slave hosts ()
2017/08/09 14:10:30 DEBUG manage [health] mysql.query end: []
2017/08/09 14:10:30 DEBUG manage could not determine primary via mysqld status: no prior replication setup found
2017/08/09 14:10:30 DEBUG manage [health] consul.get_primary start
2017/08/09 14:10:30 DEBUG manage Starting new HTTP connection (1): localhost
2017/08/09 14:10:30 DEBUG manage http://localhost:8500 "GET /v1/health/service/mysql-primary?passing=1 HTTP/1.1" 200 2
2017/08/09 14:10:30 DEBUG manage []
2017/08/09 14:10:30 DEBUG manage could not determine primary via Consul: No primary found
2017/08/09 14:10:30 DEBUG manage [health] consul.read_lock start
2017/08/09 14:10:30 DEBUG manage http://localhost:8500 "GET /v1/kv/mysql-primary HTTP/1.1" 200 119
2017/08/09 14:10:30 DEBUG manage [health] consul.read_lock end: (None, None)
2017/08/09 14:10:30 DEBUG manage [health] node.is_primary end: False
2017/08/09 14:10:30 DEBUG manage [health] node.is_primary start
2017/08/09 14:10:30 DEBUG manage [health] mysql.get_primary start
2017/08/09 14:10:30 DEBUG manage [health] mysql.query start
2017/08/09 14:10:30 DEBUG manage show slave status ()
2017/08/09 14:10:30 DEBUG manage [health] mysql.query end: []
2017/08/09 14:10:30 DEBUG manage [health] mysql.query start
2017/08/09 14:10:30 DEBUG manage show slave hosts ()
2017/08/09 14:10:30 DEBUG manage [health] mysql.query end: []
2017/08/09 14:10:30 DEBUG manage could not determine primary via mysqld status: no prior replication setup found
2017/08/09 14:10:30 DEBUG manage [health] consul.get_primary start
2017/08/09 14:10:30 DEBUG manage http://localhost:8500 "GET /v1/health/service/mysql-primary?passing=1 HTTP/1.1" 200 2
2017/08/09 14:10:30 DEBUG manage []
2017/08/09 14:10:30 DEBUG manage could not determine primary via Consul: No primary found
2017/08/09 14:10:30 DEBUG manage [health] consul.read_lock start
2017/08/09 14:10:30 DEBUG manage http://localhost:8500 "GET /v1/kv/mysql-primary HTTP/1.1" 200 119
2017/08/09 14:10:30 DEBUG manage [health] consul.read_lock end: (None, None)
2017/08/09 14:10:30 DEBUG manage [health] node.is_primary end: False
2017/08/09 14:10:30 ERROR manage Cannot determine MySQL state; failing health check.
2017/08/09 14:10:30 check.mysql exited with error: check.mysql: exit status 1
2017/08/09 14:10:30 event: {ExitFailed check.mysql}
2017/08/09 14:10:30 event: {Error check.mysql: exit status 1}
2017/08/09 14:10:30 check.mysql.Run end
2017/08/09 14:10:30 check.mysql.kill
2017/08/09 14:10:30 killing command 'check.mysql' at pid: 19213
2017/08/09 14:10:30 event: {StatusUnhealthy mysql}
2017/08/09 14:10:35 check.mysql.Run start
2017/08/09 14:10:35 DEBUG manage [health] node.health start
2017/08/09 14:10:35 DEBUG manage [health] node.assert_initialized_for_state start
2017/08/09 14:10:35 DEBUG manage [health] node.assert_initialized_for_state end: True
2017/08/09 14:10:35 DEBUG manage [health] node.is_primary start
2017/08/09 14:10:35 DEBUG manage [health] mysql.get_primary start
2017/08/09 14:10:35 DEBUG manage [health] mysql.query start
2017/08/09 14:10:35 DEBUG manage [health] mysql.wait_for_connection start
2017/08/09 14:10:35 DEBUG manage [health] mysql.wait_for_connection end
2017/08/09 14:10:35 DEBUG manage show slave status ()
2017/08/09 14:10:35 DEBUG manage [health] mysql.query end: []
2017/08/09 14:10:35 DEBUG manage [health] mysql.query start
2017/08/09 14:10:35 DEBUG manage show slave hosts ()
2017/08/09 14:10:35 DEBUG manage [health] mysql.query end: []
2017/08/09 14:10:35 DEBUG manage could not determine primary via mysqld status: no prior replication setup found
2017/08/09 14:10:35 DEBUG manage [health] consul.get_primary start
2017/08/09 14:10:35 DEBUG manage Starting new HTTP connection (1): localhost
2017/08/09 14:10:35 DEBUG manage http://localhost:8500 "GET /v1/health/service/mysql-primary?passing=1 HTTP/1.1" 200 2
2017/08/09 14:10:35 DEBUG manage []
2017/08/09 14:10:35 DEBUG manage could not determine primary via Consul: No primary found
2017/08/09 14:10:35 DEBUG manage [health] consul.read_lock start
2017/08/09 14:10:35 DEBUG manage http://localhost:8500 "GET /v1/kv/mysql-primary HTTP/1.1" 200 119
2017/08/09 14:10:35 DEBUG manage [health] consul.read_lock end: (None, None)
2017/08/09 14:10:35 DEBUG manage [health] node.is_primary end: False
2017/08/09 14:10:35 DEBUG manage [health] node.is_primary start
2017/08/09 14:10:35 DEBUG manage [health] mysql.get_primary start
2017/08/09 14:10:35 DEBUG manage [health] mysql.query start
2017/08/09 14:10:35 DEBUG manage show slave status ()
2017/08/09 14:10:35 DEBUG manage [health] mysql.query end: []
2017/08/09 14:10:35 DEBUG manage [health] mysql.query start
2017/08/09 14:10:35 DEBUG manage show slave hosts ()
2017/08/09 14:10:35 DEBUG manage [health] mysql.query end: []
2017/08/09 14:10:35 DEBUG manage could not determine primary via mysqld status: no prior replication setup found
2017/08/09 14:10:35 DEBUG manage [health] consul.get_primary start
2017/08/09 14:10:35 DEBUG manage http://localhost:8500 "GET /v1/health/service/mysql-primary?passing=1 HTTP/1.1" 200 2
2017/08/09 14:10:35 DEBUG manage []
2017/08/09 14:10:35 DEBUG manage could not determine primary via Consul: No primary found
2017/08/09 14:10:35 DEBUG manage [health] consul.read_lock start
2017/08/09 14:10:35 DEBUG manage http://localhost:8500 "GET /v1/kv/mysql-primary HTTP/1.1" 200 119
2017/08/09 14:10:35 DEBUG manage [health] consul.read_lock end: (None, None)
2017/08/09 14:10:35 DEBUG manage [health] node.is_primary end: False
2017/08/09 14:10:35 ERROR manage Cannot determine MySQL state; failing health check.
2017/08/09 14:10:35 check.mysql exited with error: check.mysql: exit status 1
2017/08/09 14:10:35 event: {ExitFailed check.mysql}
2017/08/09 14:10:35 event: {Error check.mysql: exit status 1}
2017/08/09 14:10:35 check.mysql.Run end
2017/08/09 14:10:35 check.mysql.kill
2017/08/09 14:10:35 killing command 'check.mysql' at pid: 19251
2017/08/09 14:10:35 event: {StatusUnhealthy mysql}
...
dfredell commented 7 years ago

Hello,

I too am having this same issue. I have narrowed it down to a change made on 1/16/17 in 9f5fa6a this is causing node.is_primary() in on_change to say false because kv/mysql-primary still points at the dead node which then stops the propagation of becoming master. I'm working on a fix and will do a pull request when (if) I fix it.

Steps: configure root password in env, spin up one node that restores from a minio back up, add a secondary node, 'docker rm -f' the first node.

In the log below you can see that this node still thinks mysql-3ff489835d9b is the master, even though it should be himself.

2017/09/28 18:56:50 WARNING: Using a password on the command line interface can be insecure. 2017/09/28 18:56:50 # Checking privileges. 2017/09/28 18:56:50 # Checking privileges on candidates. 2017/09/28 18:56:50 # Performing failover. 2017/09/28 18:56:50 # Candidate slave 192.168.128.219:3306 will become the new master. 2017/09/28 18:56:50 # Checking slaves status (before failover). 2017/09/28 18:56:50 # Preparing candidate for failover. 2017/09/28 18:56:50 # Creating replication user if it does not exist. 2017/09/28 18:56:50 # Stopping slaves. 2017/09/28 18:56:50 # Performing STOP on all slaves. 2017/09/28 18:56:50 # Switching slaves to new master. 2017/09/28 18:56:50 # Disconnecting new master as slave. 2017/09/28 18:56:50 # Starting slaves. 2017/09/28 18:56:50 # Performing START on all slaves. 2017/09/28 18:56:50 # Checking slaves for errors. 2017/09/28 18:56:50 # Failover complete. 2017/09/28 18:56:50 # 2017/09/28 18:56:50 # Replication Topology Health: 2017/09/28 18:56:50 +------------------+-------+---------+--------+------------+---------+ 2017/09/28 18:56:50 | host | port | role | state | gtid_mode | health | 2017/09/28 18:56:50 +------------------+-------+---------+--------+------------+---------+ 2017/09/28 18:56:50 | 192.168.128.219 | 3306 | MASTER | UP | ON | OK | 2017/09/28 18:56:50 +------------------+-------+---------+--------+------------+---------+ 2017/09/28 18:56:50 # ...done. 2017/09/28 18:56:50 DEBUG manage [on_change] mysql.failover end 2017/09/28 18:56:50 DEBUG manage [on_change] node.is_primary start 2017/09/28 18:56:50 DEBUG manage state: UNASSIGNED 2017/09/28 18:56:50 DEBUG manage [on_change] mysql.get_primary start 2017/09/28 18:56:50 DEBUG manage [on_change] mysql.query start 2017/09/28 18:56:50 DEBUG manage show slave status () 2017/09/28 18:56:50 DEBUG manage [on_change] mysql.query end: [] 2017/09/28 18:56:50 DEBUG manage [on_change] mysql.query start 2017/09/28 18:56:50 DEBUG manage show slave hosts () 2017/09/28 18:56:50 DEBUG manage [on_change] mysql.query end: [] 2017/09/28 18:56:50 DEBUG manage could not determine primary via mysqld status: no prior replication setup found 2017/09/28 18:56:50 DEBUG manage [on_change] consul.get_primary start 2017/09/28 18:56:50 DEBUG manage http://consul:8500 "GET /v1/health/service/mysql-primary?passing=1 HTTP/1.1" 200 2 2017/09/28 18:56:50 DEBUG manage [] 2017/09/28 18:56:50 DEBUG manage could not determine primary via Consul: No primary found 2017/09/28 18:56:50 DEBUG manage [on_change] consul.read_lock start 2017/09/28 18:56:50 DEBUG manage http://consul:8500 "GET /v1/kv/mysql-primary HTTP/1.1" 200 173 2017/09/28 18:56:50 DEBUG manage [on_change] consul.read_lock end: (u'47848a05-e8cf-c172-6f0f-2e8464171a25', 'mysql-3ff489835d9b') 2017/09/28 18:56:50 DEBUG manage primary_name: mysql-3ff489835d9b 2017/09/28 18:56:50 DEBUG manage [on_change] node.is_primary end: False 2017/09/28 18:56:50 DEBUG manage [on_change] node.is_primary start 2017/09/28 18:56:50 DEBUG manage state: UNASSIGNED