Closed GoogleCodeExporter closed 8 years ago
This is an expected behavior. ignore_fail parameter works on master failover
(via masterha_manager, or when running masterha_master_switch
--master_state=dead manually), but does not work when *starting*
masterha_manager. In below scenarios, ignore_fail should work.
- Start masterha_manager when all servers (including master) are alive. After
MHA enters steady-state (pinging master), kill both ignore_fail marked slave
and master.
- Run masterha_master_switch --master_state=dead when master and ignore_fail
marked slave are down.
Original comment by Yoshinor...@gmail.com
on 20 Nov 2012 at 8:25
well, i try to explain question i keep in my mind.
suppose there is little cluster with 5 machines.
s1-s2-s3-s4-s5
One of these is a master(r/w) and others are slaves(read-only)
s1
+--s2
+--s3
+--s4
+--s5
s1 - master
I am thinking about scripting and automation of starting masterha_manager. I
think to use pacemaker. I can tell pacemaker to start masterha_manager on
machine that is not master. For example on s2 machine. if s1 will dead,
masterha_manager does failover.
in case s2 will dead, pacemaker see it and start masterha_manager on other
machine - e.g. s3 and continue monitoring master and can do failover when
necessary. Also all machines in the cluster have identically mha.conf file.
But in current behavior of masterha_manager it's not possible to work in that
scenario.
Because, when s2 is dead and pacemaker will try to start mha on s3, but mha
couldnt start by reason 'one of slaves that exists in conf isn't alive'.
For me it's not matter that s2 is dead - there are else 3 full working slaves.
I would wanted to start mha monitoring and failover processing anyway in that
case.
s2 can be repaired and added to cluster without any difficulties little later.
else one case: when mha did failover death of s1 and move master to s2, in
current realization after failover mha will terminate and exit to console. So i
should start mha again on other machine e.g. s3. but it couldnt because s1 is
dead (all machines in mha.conf file must be live)
I think it'd be good to have variable such 'ignore_fail_onstart' for handling
cases of death one of slaves or machines. Or may be variable like
'count_alive_slaves' that tell mha how much slaves must be alive for next
working.
ps:also i understand that i could change mha.conf file after death of any
machines, but it does working of cluster more complex. In that way i should do
monitoring of all machines and to correct conf always when someone died or alive
Original comment by obric...@balakam.com
on 21 Nov 2012 at 10:56
I can add either command line argument or conf file parameter to skip checking
failed servers on masterha_manager start easily. I think adding a command line
argument (i.e. --ignore-fail-on-start) makes more sense.
You can try below patch if you want. It will skip checking ignore_fail marked
instances on startup.
--- lib/MHA/MasterMonitor.pm.old 2012-11-14 18:28:20.000000000 -0800
+++ lib/MHA/MasterMonitor.pm 2012-11-21 18:56:32.000000000 -0800
@@ -359,7 +359,7 @@ sub wait_until_master_is_unreachable() {
sprintf( "Identified master is %s.", $current_master->get_hostinfo() )
);
}
- $_server_manager->validate_num_alive_servers( $current_master, 0 );
+ $_server_manager->validate_num_alive_servers( $current_master, 1 );
if ( check_master_ssh_env($current_master) ) {
if ( check_master_binlog($current_master) ) {
$log->error("Master configuration failed.");
Original comment by Yoshinor...@gmail.com
on 22 Nov 2012 at 3:01
thanks for patch, it works.
i have a new problem :)
i test next case:
there is three machines with next roles
172.16.50.14 (master)
+--172.16.50.11 (slave)
+--172.16.50.13 (slave)
in this test, before start masterha_manager, i shutdown mysql on 13
then start mha_manager
it starts good, it sees that 13 is dead.
it write:
###############
###############
Tue Nov 27 17:35:28 2012 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Tue Nov 27 17:35:28 2012 - [info] Reading application default configurations
from /etc/mha_manager/app1.cnf..
Tue Nov 27 17:35:28 2012 - [info] Reading server configurations from
/etc/mha_manager/app1.cnf..
Tue Nov 27 17:35:28 2012 - [info] MHA::MasterMonitor version 0.54.
Tue Nov 27 17:35:28 2012 - [info] Dead Servers:
Tue Nov 27 17:35:28 2012 - [info] 172.16.50.13(172.16.50.13:3306)
Tue Nov 27 17:35:28 2012 - [info] Alive Servers:
Tue Nov 27 17:35:28 2012 - [info] 172.16.50.11(172.16.50.11:3306)
Tue Nov 27 17:35:28 2012 - [info] 172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:35:28 2012 - [info] Alive Slaves:
Tue Nov 27 17:35:28 2012 - [info] 172.16.50.11(172.16.50.11:3306)
Version=5.5.28-MariaDB-log (oldest major version between slaves) log-bin:enabled
Tue Nov 27 17:35:28 2012 - [info] Replicating from
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:35:28 2012 - [info] Primary candidate for the new Master
(candidate_master is set)
Tue Nov 27 17:35:28 2012 - [info] Current Alive Master:
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:35:28 2012 - [info] Checking slave configurations..
Tue Nov 27 17:35:28 2012 - [info] read_only=1 is not set on slave
172.16.50.11(172.16.50.11:3306).
Tue Nov 27 17:35:28 2012 - [warning] relay_log_purge=0 is not set on slave
172.16.50.11(172.16.50.11:3306).
Tue Nov 27 17:35:28 2012 - [info] Checking replication filtering settings..
Tue Nov 27 17:35:28 2012 - [info] binlog_do_db= testdb, binlog_ignore_db=
Tue Nov 27 17:35:28 2012 - [info] Replication filtering check ok.
Tue Nov 27 17:35:28 2012 - [info] Starting SSH connection tests..
Tue Nov 27 17:35:29 2012 - [info] All SSH connection tests passed successfully.
Tue Nov 27 17:35:29 2012 - [info] Checking MHA Node version..
Tue Nov 27 17:35:30 2012 - [info] Version check ok.
Tue Nov 27 17:35:30 2012 - [info] Checking SSH publickey authentication
settings on the current master..
Tue Nov 27 17:35:30 2012 - [info] HealthCheck: SSH to 172.16.50.14 is reachable.
Tue Nov 27 17:35:30 2012 - [info] Master MHA Node version is 0.54.
Tue Nov 27 17:35:30 2012 - [info] Checking recovery script configurations on
the current master..
Tue Nov 27 17:35:30 2012 - [info] Executing command: save_binary_logs
--command=test --start_pos=4 --binlog_dir=/home/mysqldata/
--output_file=/home/mha_manager_data/app1/save_binary_logs_test
--manager_version=0.54 --start_file=mysql-bin.000013
Tue Nov 27 17:35:30 2012 - [info] Connecting to
mha4mysql@172.16.50.14(172.16.50.14)..
Creating /home/mha_manager_data/app1 if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /home/mysqldata/, up to mysql-bin.000013
Tue Nov 27 17:35:30 2012 - [info] Master setting check done.
Tue Nov 27 17:35:30 2012 - [info] Checking SSH publickey authentication and
checking recovery script configurations on all alive slave servers..
Tue Nov 27 17:35:30 2012 - [info] Executing command : apply_diff_relay_logs
--command=test --slave_user='mha' --slave_host=172.16.50.11
--slave_ip=172.16.50.11 --slave_port=3306 --workdir=/home/mha_manager_data/app1
--target_version=5.5.28-MariaDB-log --manager_version=0.54
--relay_log_info=/home/mysqldata/relay-log.info --relay_dir=/home/mysqldata/
--slave_pass=xxx
Tue Nov 27 17:35:30 2012 - [info] Connecting to
mha4mysql@172.16.50.11(172.16.50.11:22)..
Checking slave recovery environment settings..
Opening /home/mysqldata/relay-log.info ... ok.
Relay log found at /home/mysqldata, up to mysql-relay-bin.000005
Temporary relay log file is /home/mysqldata/mysql-relay-bin.000005
Testing mysql connection and privileges.. done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Tue Nov 27 17:35:31 2012 - [info] Slaves settings check done.
Tue Nov 27 17:35:31 2012 - [info]
172.16.50.14 (current master)
+--172.16.50.11
Tue Nov 27 17:35:31 2012 - [warning] master_ip_failover_script is not defined.
Tue Nov 27 17:35:31 2012 - [warning] shutdown_script is not defined.
Tue Nov 27 17:35:31 2012 - [info] Set master ping interval 3 seconds.
Tue Nov 27 17:35:31 2012 - [warning] secondary_check_script is not defined. It
is highly recommended setting it to check master reachability from two or more
routes.
Tue Nov 27 17:35:31 2012 - [info] Starting ping health check on
172.16.50.14(172.16.50.14:3306)..
Tue Nov 27 17:35:31 2012 - [info] Ping(SELECT) succeeded, waiting until MySQL
doesn't respond..
########################
########################
while it is monitoring master, i start mysql on 13 and shutdown on 11.
Then i shutdown master and look what will happen
i expect mha_manager will start failover: it will looking again which slaves
are working and then will choose the best master candidate of them.
but it doesnt
it wants to use only 11 candidate, and doesnt think about 13
#########
#########
Tue Nov 27 17:36:07 2012 - [warning] Got error on MySQL select ping: 2006
(MySQL server has gone away)
Tue Nov 27 17:36:07 2012 - [info] Executing SSH check script: save_binary_logs
--command=test --start_pos=4 --binlog_dir=/home/mysqldata/
--output_file=/home/mha_manager_data/app1/save_binary_logs_test
--manager_version=0.54 --binlog_prefix=mysql-bin
Creating /home/mha_manager_data/app1 if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /home/mysqldata/, up to mysql-bin.000013
Tue Nov 27 17:36:07 2012 - [info] HealthCheck: SSH to 172.16.50.14 is reachable.
Tue Nov 27 17:36:10 2012 - [warning] Got error on MySQL connect: 2003 (Can't
connect to MySQL server on '172.16.50.14' (111))
Tue Nov 27 17:36:10 2012 - [warning] Connection failed 1 time(s)..
Tue Nov 27 17:36:13 2012 - [warning] Got error on MySQL connect: 2003 (Can't
connect to MySQL server on '172.16.50.14' (111))
Tue Nov 27 17:36:13 2012 - [warning] Connection failed 2 time(s)..
Tue Nov 27 17:36:16 2012 - [warning] Got error on MySQL connect: 2003 (Can't
connect to MySQL server on '172.16.50.14' (111))
Tue Nov 27 17:36:16 2012 - [warning] Connection failed 3 time(s)..
Tue Nov 27 17:36:16 2012 - [warning] Master is not reachable from health
checker!
Tue Nov 27 17:36:16 2012 - [warning] Master 172.16.50.14(172.16.50.14:3306) is
not reachable!
Tue Nov 27 17:36:16 2012 - [warning] SSH is reachable.
Tue Nov 27 17:36:16 2012 - [info] Connecting to a master server failed. Reading
configuration file /etc/masterha_default.cnf and /etc/mha_manager/app1.cnf
again, and trying to connect to all servers to check server status..
Tue Nov 27 17:36:16 2012 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Tue Nov 27 17:36:16 2012 - [info] Reading application default configurations
from /etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] Reading server configurations from
/etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] Dead Servers:
Tue Nov 27 17:36:16 2012 - [info] 172.16.50.11(172.16.50.11:3306)
Tue Nov 27 17:36:16 2012 - [info] 172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info] Alive Servers:
Tue Nov 27 17:36:16 2012 - [info] 172.16.50.13(172.16.50.13:3306)
Tue Nov 27 17:36:16 2012 - [info] Alive Slaves:
Tue Nov 27 17:36:16 2012 - [info] 172.16.50.13(172.16.50.13:3306)
Version=5.5.28-MariaDB-log (oldest major version between slaves) log-bin:enabled
Tue Nov 27 17:36:16 2012 - [info] Replicating from
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info] Primary candidate for the new Master
(candidate_master is set)
Tue Nov 27 17:36:16 2012 - [info] Checking slave configurations..
Tue Nov 27 17:36:16 2012 - [info] read_only=1 is not set on slave
172.16.50.13(172.16.50.13:3306).
Tue Nov 27 17:36:16 2012 - [warning] relay_log_purge=0 is not set on slave
172.16.50.13(172.16.50.13:3306).
Tue Nov 27 17:36:16 2012 - [info] Checking replication filtering settings..
Tue Nov 27 17:36:16 2012 - [info] Replication filtering check ok.
Tue Nov 27 17:36:16 2012 - [info] Master is down!
Tue Nov 27 17:36:16 2012 - [info] Terminating monitoring script.
Tue Nov 27 17:36:16 2012 - [info] Got exit code 20 (Master dead).
Tue Nov 27 17:36:16 2012 - [warning] Global configuration file
/etc/masterha_default.cnf not found. Skipping.
Tue Nov 27 17:36:16 2012 - [info] Reading application default configurations
from /etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] Reading server configurations from
/etc/mha_manager/app1.cnf..
Tue Nov 27 17:36:16 2012 - [info] MHA::MasterFailover version 0.54.
Tue Nov 27 17:36:16 2012 - [info] Starting master failover.
Tue Nov 27 17:36:16 2012 - [info]
Tue Nov 27 17:36:16 2012 - [info] * Phase 1: Configuration Check Phase..
Tue Nov 27 17:36:16 2012 - [info]
Tue Nov 27 17:36:16 2012 - [info] Dead Servers:
Tue Nov 27 17:36:16 2012 - [info] 172.16.50.11(172.16.50.11:3306)
Tue Nov 27 17:36:16 2012 - [info] 172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info] Checking master reachability via mysql(double
check)..
Tue Nov 27 17:36:16 2012 - [info] ok.
Tue Nov 27 17:36:16 2012 - [info] Alive Servers:
Tue Nov 27 17:36:16 2012 - [info] 172.16.50.13(172.16.50.13:3306)
Tue Nov 27 17:36:16 2012 - [info] Alive Slaves:
Tue Nov 27 17:36:16 2012 - [info] 172.16.50.13(172.16.50.13:3306)
Version=5.5.28-MariaDB-log (oldest major version between slaves) log-bin:enabled
Tue Nov 27 17:36:16 2012 - [info] Replicating from
172.16.50.14(172.16.50.14:3306)
Tue Nov 27 17:36:16 2012 - [info] Primary candidate for the new Master
(candidate_master is set)
Tue Nov 27 17:36:16 2012 - [error][/usr/share/perl/5.14/MHA/ServerManager.pm,
ln443] Server 172.16.50.11(172.16.50.11:3306) is dead, but must be alive!
Check server settings.
Tue Nov 27 17:36:16 2012 - [error][/usr/share/perl/5.14/MHA/ManagerUtil.pm,
ln178] Got ERROR: at /usr/share/perl/5.14/MHA/MasterFailover.pm line 258
#########
#########
and also it does not do failover in that case.
Questions:
1. as i understand i should restart mha_manager everytime when any of slaves
will star/stop, dont i? because only on restart mha builds list of candidates...
2. if i do as in p.1, it becames very hard to do that in pacemaker (i'm not ace
in pacemaker yet)
3. maybe could you change it's behavior and build list of candidates (listed
in conf file) when failover time happens? it would be more logical on my mind,
because slaves can do stop/start many times in their life, and what state they
will be in any point of time - nobody knows:) even they are.
Original comment by obric...@balakam.com
on 27 Nov 2012 at 2:11
Does slave 172.16.50.11 set ignore_fail=1? Otherwise MHA does not start
failover.
You don't need to restart MHA on just slave start/stop. You need to restart
when you add or remove slaves.
Original comment by Yoshinor...@gmail.com
on 28 Nov 2012 at 7:42
you're right, i forgot set ignore_fail=1 on 172.16.50.11
Now i checked it, and tested again. mha works good and make 13 as master
Original comment by obric...@balakam.com
on 28 Nov 2012 at 9:15
Original comment by Yoshinor...@gmail.com
on 16 Sep 2013 at 6:41
Original issue reported on code.google.com by
obric...@balakam.com
on 20 Nov 2012 at 2:37