acassen / keepalived

Keepalived
https://www.keepalived.org
GNU General Public License v2.0
4k stars 737 forks source link

Keepalived doesn't add real server balancing #649

Closed igroost closed 6 years ago

igroost commented 7 years ago

Faced with a problem: not added real server balancing for VIP.

Our configuration: We have a few VIP balancing of database servers (master\replica), for every VIP two real servers. In one moment in time behind every VIP is active only one real server, switching the role of the DB in balance have added the second real server, respectively, and the first to disappear. In fact we observe the following behavior. When switching the DB role on the first real server will not be checked, it is deleted from the VIP, the second real server check passes, but VIP is not added, although the logs shows that the test for it pass. If you switch DB's role back then in the VIP is added to the first real server. Also, this problem can be solved if before switching the DB role in the configuration komentirovat real server which will switch the balancing, make the reload keepalived, remove the comment from this real server and again reload keepalived.

The problem manifests itself at different times, any patterns noticed. The problem manifests itself for example on VIP 10.9.200.55\56.

This problem is observed with version 1.2.24 keepalived, use now version 1.3.5.

In attachment the configuration file

keepalived-log.txt keepalived-conf.txt

fenice2 commented 7 years ago

I'm not an expert on keepalived but I would have thought that it's better to use something like proxysql for achieving your goal, I do that and use keepalived to failover to a second or third proxysql if one of them fails.

pqarmitage commented 7 years ago

I'm not sure exactly what you are wanting to achieve, but first of all some comments on your configuration.

The following addresses are used by more than 1 virtual server (the second number on each line is the number of virtual servers using the address): 10.2.200.3 2 10.9.200.1 2 10.9.200.11 3 10.9.200.15 2 10.9.200.20 2 10.9.200.41 2 10.9.200.42 2 10.9.200.6 2 10.9.200.62 2 10.9.200.7 2 10.9.200.70 2 10.9.200.73 2 10.9.200.82 3

The problem with this is the quorum_up/quorum_down scripts which will each be adding/removing the same IP address depending on whether the service they are checking is up or down.

The virtualhost statements are unnecessary since they are only used for HTTP_GET and SSL_GET checkers.

You state that the problem is observed with v1.2.24, but you are now using v1.3.5. Do you still observe the problems with v1.3.5?

Could you run keepalived with the '-d' option and post the logs produced by that. It would also be helpful to see the complete logs of a run of keepalived from start to finish, and included in the logs are several examples of the problem you are experiencing.

igroost commented 7 years ago

Duplicate addresses in our configuration is done consciously because we need to balance different ports on the same IP address. The conditions of quorum we have as a rule made only for one of the VIP port.

Yes, that's right, the problem was noticed on version 1.2.24 and see it still. As a temporary solution use shutdown of the entire section of problem VIP, keepalived reload and re-adding VIP with keepalived reload.

Because LVS is now under load, it can send only the log with the option -d. Full log from start now to send can not. For some prizine the problem we are seeing is only on production servers, to test LVS the problem is not reproduced.

pqarmitage commented 7 years ago

I have now had a chance to look at your log file in some detail. It appears that there is something wrong on the log file, since there is a log entry timestamped 12:18:00 that appears AFTER entries timestamped 12:18:08. Since the sequencing of events is essential to diagnosing the problem, with the entries out of sequence, or extra entries inserted, it is not possible to see what is happening.

At 12:17:28 HC-DB-replica 10.1.18.21 reports success, and the next report for that is at 12:17:53 which is again reporting success. We see similar for HC-DB-master 10.2.18.21 so it appear that some log entries are missing too.

My understanding is that, after HC-DB-master 10.1.18.21 fails and the 10.9.200.55 ip address is removed, after HC-DB-master 10.2.18.21 succeeds you would expect the quorum for [10.9.200.55]:3306 to be regained and hence the 10.9.200.55 ip address to be added back. The problem is that there is no entry in the log file that shows whether HC-DB-master 10.2.18.21 was previously failed. That success for the script is reported would suggest that the script had been previously failed, but we then see two reports of success for HC-DB-replica 10.1.18.21 without a failure in between, so unfortunately without the full log file entries for the keepalived process pid 1681 from when it started, it isn't possible to diagnose what is happening, and indeed whether there is an actual problem.

igroost commented 7 years ago

My understanding is that, after HC-DB-master 10.1.18.21 fails and the 10.9.200.55 ip address is removed, after HC-DB-master 10.2.18.21 succeeds you would expect the quorum for [10.9.200.55]:3306 to be regained and hence the 10.9.200.55 ip address to be added back. The problem is that there is no entry in the log file that shows whether HC-DB-master 10.2.18.21 was previously failed. That success for the script is reported would suggest that the script had been previously failed, but we then see two reports of success for HC-DB-replica 10.1.18.21 without a failure in between, so unfortunately without the full log file entries for the keepalived process pid 1681 from when it started, it isn't possible to diagnose what is happening, and indeed whether there is an actual problem.

Yes, that's right. During the problems we are seeing in the log messages about successful tests to the problem of real-servers, but do not see a message that authentication fails.

It seems really there are duplicate messages in the log at 12:18

Tomorrow morning we plan to restart Kipelov on servers with the-d option and reproduce the problem. Then send the full log from start keepalived until you reproduce the problem.

igroost commented 7 years ago

Today I restarted keepalived with the -d option, the problem ceased to play. We will try for a couple of days to reproduce the problem because except for the inclusion of the -d option, any configuration changes were made.

igroost commented 7 years ago

Hello. Today I reproduced the described problem. The problem was observed with the real-server 10.1.72.32. Below is a piece of log at the time of the problem. The log shows that the check for the real server failed, but not made to remove it from the balancing for the VIP. Similar behavior was observed on two servers, IPVS meeting we have for the balancing. The problem was solved only complete the comments section problem VIP, reload keepalived, uncomment the section problem VIP and reload keepalived.

Sep 27 17:34:00 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 27 17:34:00 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 27 18:30:54 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.1.72.32] for [/etc/keepalived/HC/HC-DB-replica 10.1.72.32] success. Sep 27 18:30:54 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 27 18:30:54 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 27 18:37:34 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.1.72.32] for [/etc/keepalived/HC/HC-DB-replica 10.1.72.32] failed. Sep 27 18:37:34 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 27 18:37:34 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 27 18:53:15 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.2.18.27] failed. Sep 27 18:53:15 ipvs-lan-102 Keepalived_healthcheckers[26495]: Removing service [10.2.18.2]:3306 from VS [10.9.200.68]:3306 Sep 27 18:53:15 ipvs-lan-102 Keepalived_healthcheckers[26495]: Lost quorum 1-0=1 > 0 for VS [10.9.200.68]:3306

igroost commented 7 years ago

Today I reproduced the problem when working with VIP 10.9.200.2 and 10.9.200.55\56. Problematic behavior occurs on CentOS6. Also for tests we run IPVS running on CentOS7 in this problem behavior is not observed, adding\deleting of real servers is correct. Below attach the logs at the time of the problem.

`10.9.200.2

centos7

Sep 29 06:27:54 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.1.72.31] for [/etc/keepalived/HC/HC-DB-master 10.1.72.31] success. Sep 29 06:27:54 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Adding service [10.1.72.31]:3306 to VS [10.9.200.24]:3306 Sep 29 06:27:54 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Gained quorum 1+0=1 <= 1 for VS [10.9.200.24]:3306 Sep 29 06:27:54 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Executing [/sbin/ip -4 addr add 10.9.200.24/32 dev lo] for VS [10.9.200.24]:3306 Sep 29 06:27:54 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 06:27:54 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. Sep 29 10:28:13 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.1.72.32] for [/etc/keepalived/HC/HC-DB-replica 10.1.72.32] success. Sep 29 10:28:13 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Adding service [10.1.72.32]:3306 to VS [10.9.200.2]:3306 Sep 29 10:28:13 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:28:13 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. Sep 29 10:34:03 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.1.72.32] for [/etc/keepalived/HC/HC-DB-replica 10.1.72.32] failed. Sep 29 10:34:03 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Removing service [10.1.72.32]:3306 from VS [10.9.200.2]:3306 Sep 29 10:34:03 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:34:03 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. Sep 29 10:34:13 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.1.72.32] for [/etc/keepalived/HC/HC-DB-replica 10.1.72.32] success. Sep 29 10:34:13 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Adding service [10.1.72.32]:3306 to VS [10.9.200.2]:3306 Sep 29 10:34:13 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:34:13 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent.

centos6

Sep 29 06:29:57 ipvs-lan-202 Keepalived_healthcheckers[6509]: SMTP alert successfully sent. Sep 29 10:30:10 ipvs-lan-202 Keepalived_healthcheckers[6509]: Misc check to [10.1.72.32] for [/etc/keepalived/HC/HC-DB-replica 10.1.72.32] success. Sep 29 10:30:10 ipvs-lan-202 Keepalived_healthcheckers[6509]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:30:10 ipvs-lan-202 Keepalived_healthcheckers[6509]: SMTP alert successfully sent. Sep 29 10:30:20 ipvs-lan-202 Keepalived_healthcheckers[6509]: Misc check to [10.1.72.32] for [/etc/keepalived/HC/HC-DB-replica 10.1.72.32] success. Sep 29 10:30:20 ipvs-lan-202 Keepalived_healthcheckers[6509]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:30:20 ipvs-lan-202 Keepalived_healthcheckers[6509]: SMTP alert successfully sent. Sep 29 10:30:30 ipvs-lan-202 Keepalived_healthcheckers[6509]: Misc check to [10.1.72.32] for [/etc/keepalived/HC/HC-DB-replica 10.1.72.32] success. Sep 29 10:30:30 ipvs-lan-202 Keepalived_healthcheckers[6509]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:30:30 ipvs-lan-202 Keepalived_healthcheckers[6509]: SMTP alert successfully sent. Sep 29 10:36:00 ipvs-lan-202 Keepalived_healthcheckers[6509]: Misc check to [10.1.72.32] for [/etc/keepalived/HC/HC-DB-replica 10.1.72.32] failed. Sep 29 10:36:00 ipvs-lan-202 Keepalived_healthcheckers[6509]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:36:00 ipvs-lan-202 Keepalived_healthcheckers[6509]: SMTP alert successfully sent. Sep 29 10:36:10 ipvs-lan-202 Keepalived_healthcheckers[6509]: Misc check to [10.1.72.32] for [/etc/keepalived/HC/HC-DB-replica 10.1.72.32] success. Sep 29 10:36:10 ipvs-lan-202 Keepalived_healthcheckers[6509]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:36:10 ipvs-lan-202 Keepalived_healthcheckers[6509]: SMTP alert successfully sent.

10.9.200.55/56

centos6

Sep 29 10:41:28 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:51:51 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.1.18.2] for [/etc/keepalived/HC/HC-DB-master 10.1.18.21] failed. Sep 29 10:51:51 ipvs-lan-102 Keepalived_healthcheckers[26495]: Removing service [10.1.18.2]:3306 from VS [10.9.200.55]:3306 Sep 29 10:51:51 ipvs-lan-102 Keepalived_healthcheckers[26495]: Lost quorum 1-0=1 > 0 for VS [10.9.200.55]:3306 Sep 29 10:51:51 ipvs-lan-102 Keepalived_healthcheckers[26495]: Executing [/sbin/ip -4 addr del 10.9.200.55/32 dev lo] for VS [10.9.200.55]:3306 Sep 29 10:51:51 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:51:51 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:51:58 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.1.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.1.18.21] success. Sep 29 10:51:58 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:51:58 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:52:01 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.2.18.21] failed. Sep 29 10:52:01 ipvs-lan-102 Keepalived_healthcheckers[26495]: Removing service [10.2.18.2]:3306 from VS [10.9.200.56]:3306 Sep 29 10:52:01 ipvs-lan-102 Keepalived_healthcheckers[26495]: Lost quorum 1-0=1 > 0 for VS [10.9.200.56]:3306 Sep 29 10:52:01 ipvs-lan-102 Keepalived_healthcheckers[26495]: Executing [/sbin/ip -4 addr del 10.9.200.56/32 dev lo] for VS [10.9.200.56]:3306 Sep 29 10:52:01 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:01 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:52:07 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-master 10.2.18.21] success. Sep 29 10:52:07 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:07 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:52:08 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.1.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.1.18.21] success. Sep 29 10:52:08 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:08 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:52:17 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-master 10.2.18.21] success. Sep 29 10:52:17 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:17 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:52:18 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.1.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.1.18.21] success. Sep 29 10:52:18 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:18 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:52:28 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.1.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.1.18.21] success. Sep 29 10:52:28 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:28 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-master 10.2.18.21] success. Sep 29 10:52:28 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:28 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:52:28 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:52:37 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-master 10.2.18.21] success. Sep 29 10:52:37 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:37 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:54:30 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.2.18.21] success. Sep 29 10:54:30 ipvs-lan-102 Keepalived_healthcheckers[26495]: Adding service [10.2.18.2]:3306 to VS [10.9.200.56]:3306 Sep 29 10:54:30 ipvs-lan-102 Keepalived_healthcheckers[26495]: Gained quorum 1+0=1 <= 1 for VS [10.9.200.56]:3306 Sep 29 10:54:30 ipvs-lan-102 Keepalived_healthcheckers[26495]: Executing [/sbin/ip -4 addr add 10.9.200.56/32 dev lo] for VS [10.9.200.56]:3306 Sep 29 10:54:30 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:54:30 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:54:37 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-master 10.2.18.21] failed. Sep 29 10:54:37 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:54:37 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:54:38 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.1.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.1.18.21] failed. Sep 29 10:54:38 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:54:38 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent. Sep 29 10:54:40 ipvs-lan-102 Keepalived_healthcheckers[26495]: Misc check to [10.1.18.2] for [/etc/keepalived/HC/HC-DB-master 10.1.18.21] success. Sep 29 10:54:40 ipvs-lan-102 Keepalived_healthcheckers[26495]: Adding service [10.1.18.2]:3306 to VS [10.9.200.55]:3306 Sep 29 10:54:40 ipvs-lan-102 Keepalived_healthcheckers[26495]: Gained quorum 1+0=1 <= 1 for VS [10.9.200.55]:3306 Sep 29 10:54:40 ipvs-lan-102 Keepalived_healthcheckers[26495]: Executing [/sbin/ip -4 addr add 10.9.200.55/32 dev lo] for VS [10.9.200.55]:3306 Sep 29 10:54:40 ipvs-lan-102 Keepalived_healthcheckers[26495]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:54:40 ipvs-lan-102 Keepalived_healthcheckers[26495]: SMTP alert successfully sent.

centos7

Sep 29 10:34:13 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. Sep 29 10:49:53 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.1.18.2] for [/etc/keepalived/HC/HC-DB-master 10.1.18.21] failed. Sep 29 10:49:53 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Removing service [10.1.18.2]:3306 from VS [10.9.200.55]:3306 Sep 29 10:49:53 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Lost quorum 1-0=1 > 0 for VS [10.9.200.55]:3306 Sep 29 10:49:53 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Executing [/sbin/ip -4 addr del 10.9.200.55/32 dev lo] for VS [10.9.200.55]:3306 Sep 29 10:49:53 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:49:53 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. Sep 29 10:49:58 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.1.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.1.18.21] success. Sep 29 10:49:58 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Adding service [10.1.18.2]:3306 to VS [10.9.200.56]:3306 Sep 29 10:49:58 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:49:59 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. Sep 29 10:50:01 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.2.18.21] failed. Sep 29 10:50:01 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Removing service [10.2.18.2]:3306 from VS [10.9.200.56]:3306 Sep 29 10:50:01 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:50:01 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. Sep 29 10:50:01 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-master 10.2.18.21] success. Sep 29 10:50:01 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Adding service [10.2.18.2]:3306 to VS [10.9.200.55]:3306 Sep 29 10:50:01 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Gained quorum 1+0=1 <= 1 for VS [10.9.200.55]:3306 Sep 29 10:50:01 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Executing [/sbin/ip -4 addr add 10.9.200.55/32 dev lo] for VS [10.9.200.55]:3306 Sep 29 10:50:01 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:50:01 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. Sep 29 10:52:31 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.2.18.21] success. Sep 29 10:52:31 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Adding service [10.2.18.2]:3306 to VS [10.9.200.56]:3306 Sep 29 10:52:31 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:31 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. Sep 29 10:52:31 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.2.18.2] for [/etc/keepalived/HC/HC-DB-master 10.2.18.21] failed. Sep 29 10:52:31 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Removing service [10.2.18.2]:3306 from VS [10.9.200.55]:3306 Sep 29 10:52:31 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Lost quorum 1-0=1 > 0 for VS [10.9.200.55]:3306 Sep 29 10:52:31 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Executing [/sbin/ip -4 addr del 10.9.200.55/32 dev lo] for VS [10.9.200.55]:3306 Sep 29 10:52:31 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:31 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. Sep 29 10:52:38 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.1.18.2] for [/etc/keepalived/HC/HC-DB-replica 10.1.18.21] failed. Sep 29 10:52:38 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Removing service [10.1.18.2]:3306 from VS [10.9.200.56]:3306 Sep 29 10:52:38 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:38 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. Sep 29 10:52:43 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Misc check to [10.1.18.2] for [/etc/keepalived/HC/HC-DB-master 10.1.18.21] success. Sep 29 10:52:43 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Adding service [10.1.18.2]:3306 to VS [10.9.200.55]:3306 Sep 29 10:52:43 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Gained quorum 1+0=1 <= 1 for VS [10.9.200.55]:3306 Sep 29 10:52:43 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Executing [/sbin/ip -4 addr add 10.9.200.55/32 dev lo] for VS [10.9.200.55]:3306 Sep 29 10:52:43 ipvs-test-centos7 Keepalived_healthcheckers[8521]: Remote SMTP server [91.207.59.201]:25 connected. Sep 29 10:52:44 ipvs-test-centos7 Keepalived_healthcheckers[8521]: SMTP alert successfully sent. `

igroost commented 7 years ago

@pqarmitage Hello. We are still seeing the problem. Do you need us any more information to diagnose? This is a bug in the program logic or in our configuration where it is not done the correct settings?

pqarmitage commented 7 years ago

If I understand correctly your problem is as follows: On the Centos7 system:

  1. A check fails
  2. The real server is removed from the virtual server 2a. If a quorum_down script is configured, an IP address is removed
  3. The check later passes
  4. The real server is added back to the virtual server. 4a. If a quorum_up script is configured, an IP address is added

This is the behaviour that you are expecting to happen.

On the Centos6 system:

  1. A check fails
  2. The real server is removed from the virtual server 2a. If a quorum_down script is configured, an IP address is removed
  3. The check later passes
  4. The real server is NOT added back to the virtual server. This is the problem 4a. If a quorum_up script is configured, an IP address is NOT added This is also part of the problem

The problem is that when a check passes after having been failed, the up actions (i.e. adding the real server, and possibly an ip address) are not executed.

Can you provide the output of keepalived -v from both your Centos6 system and Centos7 system.

Did you build keepalived yourselves, or are you using the standard Centos packages?

It would also be helpful to have an understanding of your network topology in terms of what network interfaces exist, what addresses are configured on those interfaces, and what routes exist; the output of: ip addr show ip route show would be helpful.

This is a very big configuration, and so understanding exactly what setup you have would be really helpful.

Could you also please provide copies of your misc_scripts: /etc/keepalived/HC/HC-Crossdomain /etc/keepalived/HC/HC-DB-master /etc/keepalived/HC/HC-DB-replica /etc/keepalived/HC/HC-DNS /etc/keepalived/HC/HC-Ping-Pong /etc/keepalived/HC/HC-Redis-master

If you can provide the information and feedback based on the above points, I will have a look at it further.

pqarmitage commented 7 years ago

Could you also confirm whether the configuration gfile is exactly the same on the various systems, or are there some differences. If there are some differences, could you please either provide copies of the other configurations, or diffs it that is simpler.

igroost commented 7 years ago

@pqarmitage Yes, the problem we are seeing is only on CentOS 6. On CentOS 7 the behavior is correct.

We use Keepalived is compiled. The keepalived config file on CentOS 6 and CentOS 7 are the same, the only difference is that on server CentOS 7 we do not serve live traffic since using it for testing issues.

In the attachment the file with the command output and misc_scripts centos6_ip_addr_and_route.txt centos7_ip_addr_and_route.txt HC-Crossdomain.txt HC-DB-master.txt HC-DB-replica.txt HC-DNS.txt HC-Ping-Pong.txt HC-Redis-master.txt

pqarmitage commented 7 years ago

Many thanks for the above. Could you please also provide the output of keepalived -v from both the CentOS6 and CentOS7 systems. The results will be different due to the kernel and libraries supporting different levels of functionality. This will allow me to build identical versions for my CentOS6 and CentOS7 VMs so that I can test to see if I get the same results.

I think it would also be useful to see if manually forcing a MISC_CHECK failure exhibits the problem you are experiencing on CentOS6. If it does, then we will have a reproducible problem that should make it much simpler to resolve.

If you could configure one of your MISC_CHECKs on the CentOS6 system to use vs.sh (attached as vs.sh.txt), with a line something like: misc_path "/etc/keepalived/HC/vs.sh mc1" This should create a file /tmp/mc1.ret that contains 0. If you then do, echo 1 >/tmp/mc1.ret this should cause the MISC_CHECK to fail. After keepalived has processed the failure, do echo 0 >/tmp/mc1.ret, the MISC_CHECK should then come back up and we want to see whether the real server and IP addresses are added. vs.sh.txt

You could also try the above test on a CentOS7 system to make sure that behaves as expected.

igroost commented 7 years ago

@pqarmitage The output of keepalived CentoOS6 Keepalived v1.3.5 (03/19,2017), git commit v1.3.5-6-g6fa32f2 Copyright(C) 2001-2017 Alexandre Cassen, acassen@gmail.com Build options: PIPE2 IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF RTA_VIA FRA_OIFNAME FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA LINUX_NET_IF_H_COLLISION LVS LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS VRRP VRRP_VMAC SOCK_NONBLOCK SOCK_CLOEXEC FIB_ROUTING INET6_ADDR_GEN_MODE SNMP_V3_FOR_V2 SNMP SNMP_KEEPALIVED SNMP_CHECKER SNMP_RFC SNMP_RFCV2 SNMP_RFCV3 SO_MARK

The output of keepalived CentoOS7 Keepalived v1.3.5 (03/19,2017), git commit v1.3.5-6-g6fa32f2 Copyright(C) 2001-2017 Alexandre Cassen, acassen@gmail.com Build options: PIPE2 RTA_ENCAP RTA_EXPIRES FRA_OIFNAME FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK LINUX_NET_IF_H_COLLISION LVS VRRP VRRP_VMAC SOCK_NONBLOCK SOCK_CLOEXEC FIB_ROUTING INET6_ADDR_GEN_MODE SNMP_V3_FOR_V2 SNMP SNMP_KEEPALIVED SNMP_CHECKER SNMP_RFC SNMP_RFCV2 SNMP_RFCV3 SO_MARK

Check the operation using the validate on the production server, we can only on Friday morning.

I tried to reproduce the problem by replacing MISC_CHECKs for VIP 10.9.200.55 on vs.sh the launch of the view misc_path "/etc/keepalived/HC/vs.sh mc1" to 1 real server and misc_path "/etc/keepalived/HC/vs.sh mc2" to 2 real servers. Through echo, respectively, passed 0 for mc1.ret and 1 for mc2.ret, got VIP 10.9.200.55 up in the real server 1, after sent 1 for mc1.ret and 0 for mc2.ret, got up resl server 2. The switch back in the same place correctly.

In the process of analyzing the problem we noticed that the problem begins to play immediately, but after about 4-5 days after starting keepalived. Test keepalived on CentOS 6, I've launched today. Keepalived on CentOS7 problem has not reproduced.

igroost commented 7 years ago

If we still have some ways in addition to viewing the logs to diagnose the problem without restarting keepalived to gather more data?

pqarmitage commented 7 years ago

keepalived -v can be run while other instances of keepalived are running; it simply outputs what you have posted above and exists, so there is no need to wait to be able to take down the operational keepalived before running keepalived -v (it can even be run as a non root user).

Have you possibly got the output of keepalived -v on CentOS6 and CentOS7 above the wrong way round? What is listed as the CentOS6 output indicates a much newer kernel that the CentOS7 output. The presence of LIBIPVS_NETLINK in one output and not the other indicates that keepalived is updating the IPVS configuration in completely different ways on the two systems, so that is really helpful.

Now that I've got all the above information I'll try and see what happens on my systems.

igroost commented 7 years ago

@pqarmitage Yes, we usage different kernel on CentOS 6 and CentOS 7 servers. Following conclusions uname -r CentOS 6 and CentOS 7

CentOS 6 4.9.34-1.1.x86_64

CentOS 7 3.10.0-514.26.2.el7.x86_64

pqarmitage commented 7 years ago

At the moment there isn't enough information to work out why you are experiencing the problem. I have produced a patch that will log more information whenever there is a change in the state of checkers so we can attempt to see what is the cause of the real_server not being added back in.

If you can apply the patch and rebuild keepalived, then once the patched version is running we will get more information about what is causing keepalived not to add back the real_server. It would be helpful if this could also be run on CentOS7 as a comparison.

Unfortunately once you experience the problem again, I will need to see the full keepalived logs from the time that keepalived was started, since I think it might be possible that some variables are being overwritten, and I will need to be able to track through from the time keepalived started so check this.

The patch is log_misc_changes.patch.txt

igroost commented 7 years ago

@pqarmitage Tell me, and this patch to version 1.3.5 or for the current master version from git?

pqarmitage commented 7 years ago

@igroost - that patch is for v1.3.5, or specifically for git commit 6fa32f2 which is the exact version of the code that you are using.

pqarmitage commented 7 years ago

@igroost - with apologies, I hadn't realised that files other than ipwrapper.c were included in the patch.

Attached is a simplified version of the patch that only patches ipwrapper.c

log_misc_changes.patch.txt

igroost commented 7 years ago

Thank you. Probably next week to install the patch and as soon as problems arise I'll send in logs of work keepalived.

igroost commented 7 years ago

@pqarmitage Hello. Today installed it on production server version with the patch, yet the problem could not reproduce. According to our observations reproduced the problem starts after ~ weeks of continuous operation keepalived. As soon as I reproduced the problem I will let you know and attach the logs.

pqarmitage commented 6 years ago

@igroost - Has the problem recurred? If not, or there is no further update by the end of the year I will close this issue. If it subsequently recurs, then please add a further comment and we can reopen it.

igroost commented 6 years ago

@pqarmitage We updated the server core to 4.9.58-1.1.x86_64 and the problem stopped play. If the problem before the end of the year suddenly played back, I will write here, if not, you can close the ticket.