Shutdown of active master results in irrecoverable state

GoogleCodeExporter commented 9 years ago

There seems to be a problem leading to a stuck state.  First, the cluster
is up and stable:

# ./mmm_control show
Config file: mmm_mon.conf
Daemon is running!
Servers status:
  test01(10.0.1.91): master/ONLINE. Roles: writer(10.0.1.15;)
  test02(10.0.1.92): master/ONLINE. Roles: reader(10.0.1.11;)
  test03(10.0.1.93): slave/ONLINE. Roles: reader(10.0.1.12;)

Then I perform an orderly shutdown on test01.  A few moments later:

# ./mmm_control show
Config file: mmm_mon.conf
Daemon is running!
Servers status:
  test01(10.0.1.91): master/HARD_OFFLINE. Roles: None
  test02(10.0.1.92): master/REPLICATION_FAIL. Roles: None
  test03(10.0.1.93): slave/ONLINE. Roles: reader(10.0.1.11;),
reader(10.0.1.12;)

Trying to manually recover:

# ./mmm_control set_online test02
Config file: mmm_mon.conf
Daemon is running!
Command sent to monitoring host. Result: ERROR: This server is
'REPLICATION_FAIL' now. It can't be switched to online.

This state cannot be changed without either restarting mmmd_mon or
rebooting and setting online test01.  From the mmm-traps.log:

[2007-04-11 09:13:05]: 3563: Check: CHECK_FAIL('test01', 'mysql')
[2007-04-11 09:13:06]: 3563: Check: CHECK_FAIL('test03', 'rep_threads')
[2007-04-11 09:13:06]: 3563: Check: CHECK_FAIL('test02', 'rep_threads')
[2007-04-11 09:13:06]: 3563: Daemon: State change(test02): ONLINE ->
REPLICATION_FAIL
[2007-04-11 09:13:06]: 3563: Daemon: State change(test01): ONLINE ->
HARD_OFFLINE

I can recreate this every time so far.

Original issue reported on code.google.com by bn-s...@nesbitt.org on 11 Apr 2007 at 7:44

GoogleCodeExporter commented 9 years ago

Correction: alone, a restart of mmmd_mon does not correct the problem.  Only 
bringing
test01 back online or removing the var/mmmd.status file and restarting 
mmmd_mon, and
setting all hosts online will clear the state.

Original comment by bn-s...@nesbitt.org on 11 Apr 2007 at 7:56

GoogleCodeExporter commented 9 years ago

It should ignore replication errors if peer server is down and we know about its
status. Please, show me your config file section with hosts definitions.

Original comment by kovy...@gmail.com on 11 Apr 2007 at 8:02

Changed state: Started

GoogleCodeExporter commented 9 years ago

Is this what you need to see?

# Cluster hosts addresses and access params
host test01
    ip 10.0.1.91
    port 3306
    user root
    password
    mode master
    peer test02

host test02
    ip 10.0.1.92
    port 3306
    user root
    password
    mode master
    peer test01

host test03
    ip 10.0.1.93
    port 3306
    user root
    password
    mode slave

#
# Define roles
#

active_master_role writer

# Mysql Reader role
role reader
    mode balanced
    servers test01, test02, test03
    ip 10.0.1.11, 10.0.1.12

# Mysql Writer role
role writer
    mode exclusive
    servers test01, test02
    ip 10.0.1.15

Original comment by bn-s...@nesbitt.org on 11 Apr 2007 at 8:09

GoogleCodeExporter commented 9 years ago

You've found race condition in states management code:
1) mmm monitoring code noticed replication failure on test02 and put it to
REPLICATON_FAIL state (reasonable action)
2) then it noticed test01 death and put it to HARD_OFFLINE

if these events would occur in reverse order, test02 will not be turned off.

In revision 15 I've added small fix (need to test it) that would switch server 
back
from REPLICATION failure states if peer is down. It is little bit controversial
solution because if something happened and server A went offline, and server B 
was in
replication failure state, then server B would be switched to online which is 
not
good decision. 

Need to think how to deal with this problem correctly. Maybe it would be berrer 
to
add such verification to mmm_control and let admin set server's status to 
ONLINE from
REPLICATION_* state when peer is down? What do you think?

Original comment by kovy...@gmail.com on 11 Apr 2007 at 9:08

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Ultimately, such scenarios have always been a stumbling block.  Once the master 
is 
down, the cluster is effectively offline.  The point of the management software 
is 
to take action immediately instead of waiting for an administrator.  Making the 
wrong decision is bad but "punting" on a common failure mode is almost as bad.  
Perhaps looking deeper and determining why replication is down.  If it is due 
to a 
failure to connect to the master *and* the master is HARD_OFFLINE, then the 
solution 
is obvious, otherwise there might be a need to consult an administrator.  This 
may 
be another argument for having additional, alternate masters (or slaves which 
could 
be promoted to master) available to take over in such a situation.

Original comment by bn-s...@nesbitt.org on 12 Apr 2007 at 2:27

GoogleCodeExporter commented 9 years ago

I have upgraded to rev 15 and the behavior has changed.  Now, the detection of a
failure has taken a minute and a half.  Same scenario as before, I perform a 
shutdown
on test01.  MMM immediately detects a replication problem on 02 & 03, but takes
another 1.5 mins to declare a failure on 01.  It seems the 'mysql' check is not
timing out quickly:

[2007-04-11 09:44:13]: 11887: Check: CHECK_FAIL('test03', 'rep_threads')
[2007-04-11 09:44:13]: 11887: Check: CHECK_FAIL('test02', 'rep_threads')
[2007-04-11 09:45:49]: 11887: Check: CHECK_FAIL('test01', 'mysql')
[2007-04-11 09:45:50]: 11887: Daemon: State change(test02): ONLINE -> 
REPLICATION_FAIL
[2007-04-11 09:45:51]: 11887: Daemon: State change(test01): ONLINE -> 
HARD_OFFLINE
[2007-04-11 09:45:58]: 11887: Daemon: State change(test02): REPLICATION_FAIL -> 
ONLINE
[2007-04-11 09:46:06]: 11887: Check: CHECK_OK('test03', 'rep_threads')

Original comment by bn-s...@nesbitt.org on 13 Apr 2007 at 3:30

GoogleCodeExporter commented 9 years ago

The original issue was fixed in r15. The issue from comment 6 should be fixed 
in r23
and r27.

Original comment by m...@pascalhofmann.de on 1 Nov 2008 at 9:17

Changed state: Fixed

chusiang / mysql-master-master

Shutdown of active master results in irrecoverable state #4