Closed ykgarfield closed 5 years ago
Thank you for your careful analysis. This is the first time we have met, let's take a look first. Thanks.
谢谢你细致的分析,这是第一次遇到,我们先看看,谢谢!
It is speculated from the problem phenomenon that the modification of the same machine IP changes the group node information stored in the BDB, so that bdb mistakenly believes that there are multiple nodes, waiting for other nodes to wait for information, thus causing an exception. Try to avoid this single machine switching IP situation, we use multi-node master is distributed on different physical machines, so this has not encountered your problem before.
从问题现象推测,同台机器IP的修改改变了BDB里保存的group节点信息,让bdb误认为有多个节点,等待其他节点信息又等不到,从而引起了异常。使用时候尽量避免这种单个机器切换IP的情况吧,我们使用多节点master是分布在不同的物理机器进行,所以这之前还没有遇到你这种问题。
@ykgarfield, I think you should have this problem with this operation:
Your test machine with 2 IPs: a, b: the first configuration: hostName is a, bdbHelperHost is a: 9001, the startup is successful; the second configuration: hostName is b, but bdbHelperHost is still a: 9001, then the log report ADD HELP HOST.
You need change bdbHelperHost value from a to b: bdbHelperHost is a data synchronization node in the initial startup of the BDB cluster node. If the hostName is b, but the bdbHelperHost is still a, BDB thinks that the cluster has multiple nodes, and wait for bdbHelperHost a to start communication.
我想你应该是这样操作形成的这个问题:
你的测试机有2个IP:a,b:第一次配置: hostName为a, bdbHelperHost为a:9001,启动成功;第二次配置: hostName为b, 但bdbHelperHost仍为a:9001,启动日志报ADD HELP HOST;
bdbHelperHost要由a修改为b:bdbHelperHost是BDB集群节点里初始启动时的一个数据同步节点,如果hostName为b, 但bdbHelperHost仍为a,则让BDB认为这个集群是多个节点,从而等待bdbHelperHost a启动通讯
@ykgarfield
Thanks for your test and report .
This problem was caused by Oracle Berkeley BD Java Edition(BDB-JE) are not configured correctly.
Brief description this , BDB-JE can provide an replication environment among serveral nodes. When these nodes are started for the first time, they communicate with each other by the ip and port configured by the bdbHelperHost
in master.ini. The first node, e.g. node1, should configure itself's ip and port as the bdbHelperHost
and the following nodes should have the node1's (or any node has been already started) ip:port as bdbHelperHost
.
For example, there are 3 node:
192.168.0.1 node1
192.168.0.2 node2
192.168.0.3 node3
you can first start node1 with bdbHelperHost=192.168.0.1:9001, and then you can start node2 with 192.168.0.1:9001, and finally you can start node3 with 192.168.0.1:9001.
In your case, when run tubeMQ master on wired ip, assuming 192.168.0.1, with bdbHelperHost
= 192.168.0.1:9001. Then, run master on wireless ip , assuming 192.168.0.2, with bdbHelperHost
= 192.168.0.1:9001. BDB-JE node on 192.168.0.1 actually not exist, so it blocked.
For more information about BDB-JE, see BDB-JE.
I think when change wired network to wireless, simultaneously changing the configuration of bdbHelperHost
and hostName
will avoid this problem.
@aloyszhang @gosonzhang Yes, It's better to do this.But sometimes, we may foreget to change the configuration or unaware of it.At this moment, it's may hard to troubleshoot.It's should mention this in the document to remind users?
@ykgarfield,you are right, we should elaborate on these details in the manual.
Problem description When I work in wired network, I config the
hostName
ofmaster.ini
to the wired network IP, then I switch network to wireless network, the IP is changed, then I reconfig thehostName
to wireless network IP , start tubemq master, It will infinite blocked, the log only output util following line:(optional) Reproducer snippet Analysis and debug the source code, find that, if machine ip is changed,
RepUtils.ExceptionAwareCountDownLatch#awaitOrException()
will bloked:Final, track the source code to
RepNode#run()
:Next
Elections#initiateElection()
:Next
Elections.ElectionThread#run()
:Next
Proposer#issueProposal()
:Note than the
phase1(quorumPolicy, proposal)
is keep retrying, because this method always return null.Next
Proposer#phase1()
:Next
Proposer#tallyPhase1Results()
:Focus on
MessageExchange
, this is a task:Next
TextProtocol.MessageExchange#messageExchange()
:Here connection fail, throw
java.net.ConnectException: Connection refused: no further information
, It catch exception, no any error messages.Although I reconfig the
hostName
to wireless network IP, but thetarget
still use the wired network IP, I guess it may use the meta ofbdbEnvHome
.(optional) Suggestions for an imporvement We should introduce some mechanisms for inspection.In addition I feel the method of sleepycat(Berkeley DB) lib seems unreasonable.