请教下为什么 Replicator 发送 AppendEntriesRequest 不需要设置超时时间？

Slontia commented 2 years ago

int Replicator::start(const ReplicatorOptions& options, ReplicatorId *id) {
    if (options.log_manager == NULL || options.ballot_box == NULL
            || options.node == NULL) {
        LOG(ERROR) << "Invalid arguments, group " << options.group_id;
        return -1;
    }
    Replicator* r = new Replicator();
    brpc::ChannelOptions channel_opt;
    //channel_opt.connect_timeout_ms = *options.heartbeat_timeout_ms;
    channel_opt.timeout_ms = -1; // We don't need RPC timeout (Why?)
    if (r->_sending_channel.Init(options.peer_id.addr, &channel_opt) != 0) {
        LOG(ERROR) << "Fail to init sending channel"
                   << ", group " << options.group_id;
        delete r;
        return -1;
    }
    ...
}

在阅读 braft 代码的时候，发现这里特意设置将 channel_opt.timeout_ms 设置为了 -1，这样一来如果 follower 收到 AppendEntriesRequest 之后一直没有回包（如遭遇网络错误等原因），leader 就一直不会回调 Replicator::_on_rpc_returned 了，这样 follower 的日志就越来越落后了。

这里想请教下是处于何种考量，直接禁掉了 RPC 超时，而非采用重试机制提高容错性呢？

chenzhangyi commented 2 years ago

重试会带来雪崩，网络出错连接会断，RPC减少

Slontia commented 2 years ago

明白了，感谢~

baidu / braft

请教下为什么 Replicator 发送 AppendEntriesRequest 不需要设置超时时间？ #331