Tencent / phxpaxos

The Paxos library implemented in C++ that has been used in the WeChat production environment.
Other
3.35k stars 862 forks source link

放到生产环境,部分节点无法加入 #43

Closed coolcgp closed 7 years ago

coolcgp commented 7 years ago

问题描述:安装编译phxpaxos以后,编译sample/phxecho例子,单节点下使用run_echo.sh的方式可以正常使用。但是,放入到生产环境,依然运行run_echo.sh ./phxecho 192.168.102.157:11111 192.168.102.157:11111,192.168.102.158:11112,192.168.102.159:11113 ./phxecho 192.168.102.158:11112 192.168.102.157:11111,192.168.102.158:11112,192.168.102.159:11113 ./phxecho 192.168.102.159:11113 192.168.102.157:11111,192.168.102.158:11112,192.168.102.159:11113 其中158和159可以输出内容,但是157,仅显示下 run paxos ok echo server start, ip 192.168.102.157 port 11111 please input: <echo req value> ,157节点不输出内容,说明没有连接成功,已经安装了免密码ssh. 请问如何解决问题,或者查看某一部分的日志或代码?是不是自己配置protobuf/gmock的问题

lynncui00 commented 7 years ago

这个可能是157这台机器和其他两台机器网络不通,但具体还是要日志进行观察确定。 通过设置options里面的pLogFunc函数来进行打日志,或直接设置eLogLevelLogLevel::LogLevel_Verbose直接在标准输出打出日志。

coolcgp commented 7 years ago

可以ping通157,可以免密码 ssh的通157,我现在赶紧试试你说的打印日志,谢谢你的回复。

2017-01-10

coolcgp

发件人:Haochuan Cui notifications@github.com 发送时间:2017-01-10 12:38 主题:Re: [tencent-wechat/phxpaxos] 放到生产环境,部分节点无法加入 (#43) 收件人:"tencent-wechat/phxpaxos"phxpaxos@noreply.github.com 抄送:"coolcgp"coolcgp@163.com,"Author"author@noreply.github.com

这个可能是157这台机器和其他两台机器网络不通,但具体还是要日志进行观察确定。 通过设置options里面的pLogFunc函数来进行打日志,或直接设置eLogLevel为LogLevel::LogLevel_Verbose直接在标准输出打出日志。 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

coolcgp commented 7 years ago

2017-01-10_14-30-28

日志打印是: Log file created at: 2017/01/09 22:29:13 Running on machine: master Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg E0109 22:29:13.471472 8566 logger_google.cpp:99] init_glog_warning_file E0109 22:29:13.476591 8567 logger_google.cpp:99] ^[[41;37m ERR(0): PN8phxpaxos13AcceptorStateE::Load empty database ^[[0m E0109 22:29:13.476608 8567 logger_google.cpp:99] ^[[41;37m ERR(0): PN8phxpaxos8DatabaseE::GetMinChosenInstanceID no min chosen instanceid ^[[0m E0109 22:29:13.477138 8570 logger_google.cpp:99] STATUS(0): PN8phxpaxos7CleanerE::run sleep a while, max deleted instanceid 0 checkpoint instanceid (no checkpoint) now instanceid 0 E0109 22:29:14.471451 8570 logger_google.cpp:99] STATUS(0): PN8phxpaxos7CleanerE::run sleep a while, max deleted instanceid 0 checkpoint instanceid (no checkpoint) now instanceid 0 E0109 22:29:15.425976 8570 logger_google.cpp:99] STATUS(0): PN8phxpaxos7CleanerE::run sleep a while, max deleted instanceid 0 checkpoint instanceid (no checkpoint) now instanceid 0 E0109 22:29:15.972389 8570 logger_google.cpp:99] STATUS(0): PN8phxpaxos7CleanerE::run sleep a while, max deleted instanceid 0 checkpoint instanceid (no checkpoint) now instanceid 0 E0109 22:29:16.479308 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 22 socket ip 192.168.102.158 errno 0 ^[[0m E0109 22:29:16.479523 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 23 socket ip 192.168.102.159 errno 0 ^[[0m E0109 22:29:16.681579 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos12MessageEventE::ReConnect start, ip 192.168.102.158 ^[[0m E0109 22:29:16.683951 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos12MessageEventE::ReConnect start, ip 192.168.102.159 ^[[0m E0109 22:29:16.684058 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 22 socket ip 192.168.102.158 errno 115 ^[[0m E0109 22:29:16.684108 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 23 socket ip 192.168.102.159 errno 115 ^[[0m E0109 22:29:16.770907 8570 logger_google.cpp:99] STATUS(0): PN8phxpaxos7CleanerE::run sleep a while, max deleted instanceid 0 checkpoint instanceid (no checkpoint) now instanceid 0 E0109 22:29:16.883750 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos12MessageEventE::ReConnect start, ip 192.168.102.158 ^[[0m E0109 22:29:16.883973 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos12MessageEventE::ReConnect start, ip 192.168.102.159 ^[[0m E0109 22:29:16.884245 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 22 socket ip 192.168.102.158 errno 115 ^[[0m E0109 22:29:16.884270 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 23 socket ip 192.168.102.159 errno 115 ^[[0m 而且使用netstat查看连接和端口: 问题节点157: 157 正常的节点158: 158 正常的节点159: 159 157 ping 158, 159机器都没问题,ssh 158, 159机器也没问题,反过来连接测试亦是正常。如何定位这个问题呢?问题机器一直不能用。 难道非得是重装一遍,然后使用一台机器,一台一台的clone吗?159 clone的 158,就可以使用,我又clone了一台160,依然可用,唯独157不是clone的,是不是这个原因呢? 第一张图请双击,就可以放大看到了。我将 echo_server.cpp 中oOptions.eLogLevel=LogLevel::LogLevel_Verbose; 在线等。

lynncui00 commented 7 years ago

从159的netstat来看,159->157的链接是SYN_SENT状态,说明157这台机确实是有问题的,具体是什么问题得由你来定位了。ping和ssh都不能确定网络是ok的。 image

coolcgp commented 7 years ago

已经成功,确实是网络的问题,因为是master的节点,设置11111端口不能访问。这里共享解决方法:

第一,测试TCP连接是否可用?

服务器端监听 11111 nc -l 11111 客户端访问 192.168.102.157:11111 nc 192.168.102.157 11111 客户端直接输入内容,查看服务器端是否输出。如果是Ncat: No route to host.说明TCP无法创建连接,ping 和 ssh 只是需要IP,ssh只是使用TCP 22的端口,所以ping成功,ssh成功连接节点,未必能成功创建TCP连接。

第二,关闭防火墙,打开指定端口

1.CentOS 7 先关闭防火墙, 关闭以后,直接试一试 systemctl stop firewalld.service #停止firewall
systemctl disable firewalld.service #禁止firewall开机启动 2、安装并设置 iptables-service sudo yum -y install iptables-services 增加防火墙允许访问的端口 11111 sudo vim /etc/sysconfig/iptables -A INPUT -m state --state NEW -m tcp -p tcp --dport 11111 -j ACCEPT

  1. 保存以上设置后,重启防火墙 systemctl restart iptables.service #重启防火墙使配置生效 systemctl enable iptables.service#设置防火墙开机启动
  2. 不建议重启系统,但是如果不行,可以直接重启系统,就OK了。 谢谢 @lynncui00
gogobody commented 4 years ago

遇到同样的问题,但是节点之间nc以及socket连接均正常