autopilotpattern / mysql

Implementation of the autopilot pattern for MySQL
Mozilla Public License 2.0
172 stars 68 forks source link

Failover lost master #89

Open dfredell opened 7 years ago

dfredell commented 7 years ago

I found a scenario where the cluster looses its master.

It occurred when:

  1. I had 3 nodes running healthily, remote consul, static root password
  2. I killed the master
  3. Failover started on 37
  4. mysqlrpladmin on 37 decided that 36 should be the master
  5. 36 detected that he is the new master
  6. 36 creates a new containerpilot.json with the service 'mysql-primary`
  7. Then 36 runs containerpilot -reload
  8. This causes mysql to stop and start
  9. When mysql comes back up mysql doesn't have a record of primary
  10. Also when reading from /v1/kv/mysql-primary there is no result

failover.log servers

  1. docker compose name: mysql_4 hostname: mysql-37f99a0a7a84 IP:192.168.128.236
  2. docker compose name: mysql_5 hostname: mysql-363deb257281 IP:192.168.128.235

The fail over works great if the node that gets the fail-over lock also wins the mysqlrpladmin poll.

dfredell commented 7 years ago

Node 36 is assigned service/mysql-primary then does a reboot because of containerpilot -reload. Then he doesn't remember who master is supposed to be or where his friends are.