baidu / braft

An industrial-grade C++ implementation of RAFT consensus algorithm based on brpc, widely used inside Baidu to build highly-available distributed systems.
Apache License 2.0
3.95k stars 881 forks source link

请教下为什么 Replicator 发送 AppendEntriesRequest 不需要设置超时时间? #331

Closed Slontia closed 2 years ago

Slontia commented 2 years ago
int Replicator::start(const ReplicatorOptions& options, ReplicatorId *id) {
    if (options.log_manager == NULL || options.ballot_box == NULL
            || options.node == NULL) {
        LOG(ERROR) << "Invalid arguments, group " << options.group_id;
        return -1;
    }
    Replicator* r = new Replicator();
    brpc::ChannelOptions channel_opt;
    //channel_opt.connect_timeout_ms = *options.heartbeat_timeout_ms;
    channel_opt.timeout_ms = -1; // We don't need RPC timeout (Why?)
    if (r->_sending_channel.Init(options.peer_id.addr, &channel_opt) != 0) {
        LOG(ERROR) << "Fail to init sending channel"
                   << ", group " << options.group_id;
        delete r;
        return -1;
    }
    ...
}

在阅读 braft 代码的时候,发现这里特意设置将 channel_opt.timeout_ms 设置为了 -1,这样一来如果 follower 收到 AppendEntriesRequest 之后一直没有回包(如遭遇网络错误等原因),leader 就一直不会回调 Replicator::_on_rpc_returned 了,这样 follower 的日志就越来越落后了。

这里想请教下是处于何种考量,直接禁掉了 RPC 超时,而非采用重试机制提高容错性呢?

chenzhangyi commented 2 years ago

重试会带来雪崩, 网络出错连接会断,RPC减少

Slontia commented 2 years ago

明白了,感谢~