tubme master startup infinite blocked if the machine node IP changed

ykgarfield commented 5 years ago

Problem description When I work in wired network, I config the hostName of master.ini to the wired network IP, then I switch network to wireless network, the IP is changed, then I reconfig the hostName to wireless network IP , start tubemq master, It will infinite blocked, the log only output util following line:

(main) [INFO - com.tencent.tubemq.server.master.bdbstore.DefaultBdbStoreService.initEnvConfig(DefaultBdbStoreService.java:996)] ADD HELP HOST

// It's will blocked, no more output

(optional) Reproducer snippet Analysis and debug the source code, find that, if machine ip is changed, RepUtils.ExceptionAwareCountDownLatch#awaitOrException() will bloked:

public boolean awaitOrException(long timeout, TimeUnit unit)
    throws InterruptedException,
           DatabaseException {
        // blocked
    boolean done = super.await(timeout, unit);
    ...
}

Final, track the source code to RepNode#run():

public void run() {
    ...
    if (nameIdPair.hasNullId() || !nodeType.isElectable()) {
        queryGroupForMembership();
    } else {
        // here blocked
        elections.initiateElection(group, electionQuorumPolicy);
        ...
    }
    ...
}

Next Elections#initiateElection():

public synchronized void initiateElection(RepGroupImpl newGroup, QuorumPolicy quorumPolicy, int maxRetries) {
    RetryPredicate retryPredicate =
            new RetryPredicate(repNode, maxRetries, countDownLatch);
        electionThread = new ElectionThread(quorumPolicy, retryPredicate,
                                            envImpl,
                                            (envImpl == null) ? null :
                                            envImpl.getName());
    electionThread.start();
    try {
        // here blocked
        /* Wait until we hear of some "new" election result */
        countDownLatch.await();
        ...
    } 
}

Next Elections.ElectionThread#run():

public void run() {
    ...
    winningProposal =
                    proposer.issueProposal(quorumPolicy, retryPredicate);
    ...
}

Next Proposer#issueProposal():

public WinningProposal issueProposal(QuorumPolicy quorumPolicy, RetryPredicate retryPredicate) {
    while (retryPredicate.retry()) {
        try {
            final Proposal proposal = nextProposal();
            // Keep retrying
            final Phase1Result result1 = phase1(quorumPolicy, proposal);
            if (result1 == null) {
                continue;
            }
            ...
        }
    }
}

Note than the phase1(quorumPolicy, proposal) is keep retrying, because this method always return null.

Next Proposer#phase1():

private Phase1Result phase1(QuorumPolicy quorumPolicy, Proposal proposal) {
    ...
    Phase1Result result = tallyPhase1Results(proposal, compService);
    // always false
    if (haveQuorum(quorumPolicy, result.promisories.size())) {
        return result;
    }
    phase1NoQuorum.increment();

    // always return null
    return null;
}

Next Proposer#tallyPhase1Results():

private Phase1Result tallyPhase1Results(Proposal currentProposal, final FutureTrackingCompService<MessageExchange> compService) {
    ...
    new Utils.WithFutureExceptionHandler<MessageExchange>
                (compService, 2 * elections.getProtocol().getReadTimeout(),
                 TimeUnit.MILLISECONDS, logger, elections.getRepImpl(), null) {
    ...
}

Focus on MessageExchange, this is a task:

public void run() {
    messageExchange();
}

Next TextProtocol.MessageExchange#messageExchange():

public void messageExchange() {

    DataChannel dataChannel = null;
    BufferedReader in = null;
    PrintWriter out = null;
    try {
        dataChannel =
                       // when in wireless network use  the IP of wired network, the connection will fail
                      // will throw java.net.ConnectException: Connection refused: no further information
            channelFactory.connect(
                target,
                new ConnectOptions().
                setTcpNoDelay(true).
                setOpenTimeout(openTimeoutMs).
                setReadTimeout(readTimeoutMs).
                setBlocking(true).
                setReuseAddr(true));
        ...
    } catch (java.net.SocketTimeoutException e){
        this.exception = e;
    } catch (SocketException e) {
        this.exception = e;
    } catch (IOException e) {
        this.exception = e;
    } catch (TextProtocol.InvalidMessageException ime) {
        ...
        this.exception = ime;
    } catch (ServiceConnectFailedException e) {
        this.exception = e;
    } catch (Exception e) {
        ...
    } finally {
        Utils.cleanup(logger, repImpl, formatter, dataChannel, in, out);
    }
}

Here connection fail, throw java.net.ConnectException: Connection refused: no further information, It catch exception, no any error messages.

Although I reconfig the hostName to wireless network IP, but the target still use the wired network IP, I guess it may use the meta of bdbEnvHome.

(optional) Suggestions for an imporvement We should introduce some mechanisms for inspection.In addition I feel the method of sleepycat(Berkeley DB) lib seems unreasonable.

gosonzhang commented 5 years ago

Thank you for your careful analysis. This is the first time we have met, let's take a look first. Thanks.

谢谢你细致的分析，这是第一次遇到，我们先看看，谢谢！

gosonzhang commented 5 years ago

It is speculated from the problem phenomenon that the modification of the same machine IP changes the group node information stored in the BDB, so that bdb mistakenly believes that there are multiple nodes, waiting for other nodes to wait for information, thus causing an exception. Try to avoid this single machine switching IP situation, we use multi-node master is distributed on different physical machines, so this has not encountered your problem before.

从问题现象推测，同台机器IP的修改改变了BDB里保存的group节点信息，让bdb误认为有多个节点，等待其他节点信息又等不到，从而引起了异常。使用时候尽量避免这种单个机器切换IP的情况吧，我们使用多节点master是分布在不同的物理机器进行，所以这之前还没有遇到你这种问题。

gosonzhang commented 5 years ago

@ykgarfield, I think you should have this problem with this operation:

Your test machine with 2 IPs: a, b: the first configuration: hostName is a, bdbHelperHost is a: 9001, the startup is successful; the second configuration: hostName is b, but bdbHelperHost is still a: 9001, then the log report ADD HELP HOST.

You need change bdbHelperHost value from a to b: bdbHelperHost is a data synchronization node in the initial startup of the BDB cluster node. If the hostName is b, but the bdbHelperHost is still a, BDB thinks that the cluster has multiple nodes, and wait for bdbHelperHost a to start communication.

我想你应该是这样操作形成的这个问题:

你的测试机有2个IP:a,b：第一次配置: hostName为a, bdbHelperHost为a:9001,启动成功；第二次配置: hostName为b, 但bdbHelperHost仍为a:9001,启动日志报ADD HELP HOST；

bdbHelperHost要由a修改为b：bdbHelperHost是BDB集群节点里初始启动时的一个数据同步节点，如果hostName为b, 但bdbHelperHost仍为a，则让BDB认为这个集群是多个节点，从而等待bdbHelperHost a启动通讯

aloyszhang commented 5 years ago

@ykgarfield Thanks for your test and report . This problem was caused by Oracle Berkeley BD Java Edition(BDB-JE) are not configured correctly. Brief description this , BDB-JE can provide an replication environment among serveral nodes. When these nodes are started for the first time, they communicate with each other by the ip and port configured by the bdbHelperHost in master.ini. The first node, e.g. node1, should configure itself's ip and port as the bdbHelperHost and the following nodes should have the node1's (or any node has been already started) ip:port as bdbHelperHost. For example, there are 3 node: 192.168.0.1 node1 192.168.0.2 node2 192.168.0.3 node3 you can first start node1 with bdbHelperHost=192.168.0.1:9001, and then you can start node2 with 192.168.0.1:9001, and finally you can start node3 with 192.168.0.1:9001. In your case, when run tubeMQ master on wired ip, assuming 192.168.0.1, with bdbHelperHost = 192.168.0.1:9001. Then， run master on wireless ip , assuming 192.168.0.2, with bdbHelperHost = 192.168.0.1:9001. BDB-JE node on 192.168.0.1 actually not exist, so it blocked. For more information about BDB-JE, see BDB-JE.

I think when change wired network to wireless, simultaneously changing the configuration of bdbHelperHost and hostName will avoid this problem.

ykgarfield commented 5 years ago

@aloyszhang @gosonzhang Yes, It's better to do this.But sometimes, we may foreget to change the configuration or unaware of it.At this moment, it's may hard to troubleshoot.It's should mention this in the document to remind users?

gosonzhang commented 5 years ago

@ykgarfield，you are right, we should elaborate on these details in the manual.

Tencent / TubeMQ

tubme master startup infinite blocked if the machine node IP changed #37