Closed GoogleCodeExporter closed 9 years ago
Correction: alone, a restart of mmmd_mon does not correct the problem. Only
bringing
test01 back online or removing the var/mmmd.status file and restarting
mmmd_mon, and
setting all hosts online will clear the state.
Original comment by bn-s...@nesbitt.org
on 11 Apr 2007 at 7:56
It should ignore replication errors if peer server is down and we know about its
status. Please, show me your config file section with hosts definitions.
Original comment by kovy...@gmail.com
on 11 Apr 2007 at 8:02
Is this what you need to see?
# Cluster hosts addresses and access params
host test01
ip 10.0.1.91
port 3306
user root
password
mode master
peer test02
host test02
ip 10.0.1.92
port 3306
user root
password
mode master
peer test01
host test03
ip 10.0.1.93
port 3306
user root
password
mode slave
#
# Define roles
#
active_master_role writer
# Mysql Reader role
role reader
mode balanced
servers test01, test02, test03
ip 10.0.1.11, 10.0.1.12
# Mysql Writer role
role writer
mode exclusive
servers test01, test02
ip 10.0.1.15
Original comment by bn-s...@nesbitt.org
on 11 Apr 2007 at 8:09
You've found race condition in states management code:
1) mmm monitoring code noticed replication failure on test02 and put it to
REPLICATON_FAIL state (reasonable action)
2) then it noticed test01 death and put it to HARD_OFFLINE
if these events would occur in reverse order, test02 will not be turned off.
In revision 15 I've added small fix (need to test it) that would switch server
back
from REPLICATION failure states if peer is down. It is little bit controversial
solution because if something happened and server A went offline, and server B
was in
replication failure state, then server B would be switched to online which is
not
good decision.
Need to think how to deal with this problem correctly. Maybe it would be berrer
to
add such verification to mmm_control and let admin set server's status to
ONLINE from
REPLICATION_* state when peer is down? What do you think?
Original comment by kovy...@gmail.com
on 11 Apr 2007 at 9:08
Ultimately, such scenarios have always been a stumbling block. Once the master
is
down, the cluster is effectively offline. The point of the management software
is
to take action immediately instead of waiting for an administrator. Making the
wrong decision is bad but "punting" on a common failure mode is almost as bad.
Perhaps looking deeper and determining why replication is down. If it is due
to a
failure to connect to the master *and* the master is HARD_OFFLINE, then the
solution
is obvious, otherwise there might be a need to consult an administrator. This
may
be another argument for having additional, alternate masters (or slaves which
could
be promoted to master) available to take over in such a situation.
Original comment by bn-s...@nesbitt.org
on 12 Apr 2007 at 2:27
I have upgraded to rev 15 and the behavior has changed. Now, the detection of a
failure has taken a minute and a half. Same scenario as before, I perform a
shutdown
on test01. MMM immediately detects a replication problem on 02 & 03, but takes
another 1.5 mins to declare a failure on 01. It seems the 'mysql' check is not
timing out quickly:
[2007-04-11 09:44:13]: 11887: Check: CHECK_FAIL('test03', 'rep_threads')
[2007-04-11 09:44:13]: 11887: Check: CHECK_FAIL('test02', 'rep_threads')
[2007-04-11 09:45:49]: 11887: Check: CHECK_FAIL('test01', 'mysql')
[2007-04-11 09:45:50]: 11887: Daemon: State change(test02): ONLINE ->
REPLICATION_FAIL
[2007-04-11 09:45:51]: 11887: Daemon: State change(test01): ONLINE ->
HARD_OFFLINE
[2007-04-11 09:45:58]: 11887: Daemon: State change(test02): REPLICATION_FAIL ->
ONLINE
[2007-04-11 09:46:06]: 11887: Check: CHECK_OK('test03', 'rep_threads')
Original comment by bn-s...@nesbitt.org
on 13 Apr 2007 at 3:30
The original issue was fixed in r15. The issue from comment 6 should be fixed
in r23
and r27.
Original comment by m...@pascalhofmann.de
on 1 Nov 2008 at 9:17
Original issue reported on code.google.com by
bn-s...@nesbitt.org
on 11 Apr 2007 at 7:44