Closed coolcgp closed 7 years ago
这个可能是157这台机器和其他两台机器网络不通,但具体还是要日志进行观察确定。
通过设置options
里面的pLogFunc
函数来进行打日志,或直接设置eLogLevel
为LogLevel::LogLevel_Verbose
直接在标准输出打出日志。
可以ping通157,可以免密码 ssh的通157,我现在赶紧试试你说的打印日志,谢谢你的回复。
2017-01-10
coolcgp
发件人:Haochuan Cui notifications@github.com 发送时间:2017-01-10 12:38 主题:Re: [tencent-wechat/phxpaxos] 放到生产环境,部分节点无法加入 (#43) 收件人:"tencent-wechat/phxpaxos"phxpaxos@noreply.github.com 抄送:"coolcgp"coolcgp@163.com,"Author"author@noreply.github.com
这个可能是157这台机器和其他两台机器网络不通,但具体还是要日志进行观察确定。 通过设置options里面的pLogFunc函数来进行打日志,或直接设置eLogLevel为LogLevel::LogLevel_Verbose直接在标准输出打出日志。 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
日志打印是: Log file created at: 2017/01/09 22:29:13 Running on machine: master Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg E0109 22:29:13.471472 8566 logger_google.cpp:99] init_glog_warning_file E0109 22:29:13.476591 8567 logger_google.cpp:99] ^[[41;37m ERR(0): PN8phxpaxos13AcceptorStateE::Load empty database ^[[0m E0109 22:29:13.476608 8567 logger_google.cpp:99] ^[[41;37m ERR(0): PN8phxpaxos8DatabaseE::GetMinChosenInstanceID no min chosen instanceid ^[[0m E0109 22:29:13.477138 8570 logger_google.cpp:99] STATUS(0): PN8phxpaxos7CleanerE::run sleep a while, max deleted instanceid 0 checkpoint instanceid (no checkpoint) now instanceid 0 E0109 22:29:14.471451 8570 logger_google.cpp:99] STATUS(0): PN8phxpaxos7CleanerE::run sleep a while, max deleted instanceid 0 checkpoint instanceid (no checkpoint) now instanceid 0 E0109 22:29:15.425976 8570 logger_google.cpp:99] STATUS(0): PN8phxpaxos7CleanerE::run sleep a while, max deleted instanceid 0 checkpoint instanceid (no checkpoint) now instanceid 0 E0109 22:29:15.972389 8570 logger_google.cpp:99] STATUS(0): PN8phxpaxos7CleanerE::run sleep a while, max deleted instanceid 0 checkpoint instanceid (no checkpoint) now instanceid 0 E0109 22:29:16.479308 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 22 socket ip 192.168.102.158 errno 0 ^[[0m E0109 22:29:16.479523 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 23 socket ip 192.168.102.159 errno 0 ^[[0m E0109 22:29:16.681579 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos12MessageEventE::ReConnect start, ip 192.168.102.158 ^[[0m E0109 22:29:16.683951 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos12MessageEventE::ReConnect start, ip 192.168.102.159 ^[[0m E0109 22:29:16.684058 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 22 socket ip 192.168.102.158 errno 115 ^[[0m E0109 22:29:16.684108 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 23 socket ip 192.168.102.159 errno 115 ^[[0m E0109 22:29:16.770907 8570 logger_google.cpp:99] STATUS(0): PN8phxpaxos7CleanerE::run sleep a while, max deleted instanceid 0 checkpoint instanceid (no checkpoint) now instanceid 0 E0109 22:29:16.883750 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos12MessageEventE::ReConnect start, ip 192.168.102.158 ^[[0m E0109 22:29:16.883973 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos12MessageEventE::ReConnect start, ip 192.168.102.159 ^[[0m E0109 22:29:16.884245 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 22 socket ip 192.168.102.158 errno 115 ^[[0m E0109 22:29:16.884270 8573 logger_google.cpp:99] ^[[41;37m ERR: PN8phxpaxos9EventLoopE::OnError event error, events 28 socketfd 23 socket ip 192.168.102.159 errno 115 ^[[0m
而且使用netstat查看连接和端口:
问题节点157:
正常的节点158:
正常的节点159:
157 ping 158, 159机器都没问题,ssh 158, 159机器也没问题,反过来连接测试亦是正常。如何定位这个问题呢?问题机器一直不能用。 难道非得是重装一遍,然后使用一台机器,一台一台的clone吗?159 clone的 158,就可以使用,我又clone了一台160,依然可用,唯独157不是clone的,是不是这个原因呢? 第一张图请双击,就可以放大看到了。我将 echo_server.cpp 中oOptions.eLogLevel=LogLevel::LogLevel_Verbose; 在线等。
从159的netstat来看,159->157的链接是SYN_SENT状态,说明157这台机确实是有问题的,具体是什么问题得由你来定位了。ping和ssh都不能确定网络是ok的。
服务器端监听 11111
nc -l 11111
客户端访问 192.168.102.157:11111
nc 192.168.102.157 11111
客户端直接输入内容,查看服务器端是否输出。如果是Ncat: No route to host.说明TCP无法创建连接,ping 和 ssh 只是需要IP,ssh只是使用TCP 22的端口,所以ping成功,ssh成功连接节点,未必能成功创建TCP连接。
1.CentOS 7 先关闭防火墙, 关闭以后,直接试一试
systemctl stop firewalld.service #停止firewall
systemctl disable firewalld.service #禁止firewall开机启动
2、安装并设置 iptables-service
sudo yum -y install iptables-services
增加防火墙允许访问的端口 11111
sudo vim /etc/sysconfig/iptables
-A INPUT -m state --state NEW -m tcp -p tcp --dport 11111 -j ACCEPT
systemctl restart iptables.service
#重启防火墙使配置生效
systemctl enable iptables.service
#设置防火墙开机启动遇到同样的问题,但是节点之间nc以及socket连接均正常
问题描述:安装编译phxpaxos以后,编译sample/phxecho例子,单节点下使用run_echo.sh的方式可以正常使用。但是,放入到生产环境,依然运行run_echo.sh
./phxecho 192.168.102.157:11111 192.168.102.157:11111,192.168.102.158:11112,192.168.102.159:11113 ./phxecho 192.168.102.158:11112 192.168.102.157:11111,192.168.102.158:11112,192.168.102.159:11113 ./phxecho 192.168.102.159:11113 192.168.102.157:11111,192.168.102.158:11112,192.168.102.159:11113
其中158和159可以输出内容,但是157,仅显示下run paxos ok echo server start, ip 192.168.102.157 port 11111 please input: <echo req value>
,157节点不输出内容,说明没有连接成功,已经安装了免密码ssh. 请问如何解决问题,或者查看某一部分的日志或代码?是不是自己配置protobuf/gmock的问题