baidu / braft

An industrial-grade C++ implementation of RAFT consensus algorithm based on brpc, widely used inside Baidu to build highly-available distributed systems.
Apache License 2.0
3.97k stars 884 forks source link

bootstrap总有出错信息 #122

Closed hjk41 closed 5 years ago

hjk41 commented 5 years ago

初始化时因为已经有数据,所以做了bootstrap,但是总是在bootstrap这里提示出错。虽然后面也能继续跑,但程序就会出各种诡异bug,感觉是栈乱掉了。

不知道栈坏掉跟这个出错是否有关系,以及为什么会有这个错误?

出错信息如下:

[braft/src/braft/log.cpp:629]: Use crc32c as the checksum type of appending entries
[braft/src/braft/log.cpp:1015]: load open segment, path: ./ha_log/log first_index: 1025
[braft/src/braft/snapshot.cpp:432]: Deleting ./ha_log/snapshot/temp
[braft/src/braft/snapshot_executor.cpp:242]: node :0.0.0.0:0:0 snapshot_load_done, last_included_index: 1024 last_included_term: 1
[braft/src/braft/node.cpp:384]: Check failed: _log_manager->last_log_index() == options.last_log_index (1027 vs 1024).
20190412000338.456: #0 0x000000f0f54f braft::NodeImpl::bootstrap()
20190412000338.456: #1 0x000000eb2cff braft::bootstrap()
20190412000338.456: #2 0x000000b979ba lgraph::HaStateMachine::Start()
20190412000338.456: #3 0x000000b9c660 LGraphService::Run()
20190412000338.456: #4 0x000000b5c913 main
20190412000338.456: #5 0x7fde41fbf830 __libc_start_main
20190412000338.456: #6 0x000000b584e9 _start

代码如下:

braft::BootstrapOptions options;
if (options.group_conf.parse_from(config_.ha_init_config) != 0) {
    ERR_STREAM(logger_) << "Fail to parse configuration `" << config_.ha_init_config;
    ::ns::StateMachine::Stop();
    return;
}
options.fsm = this;
options.node_owns_fsm = false;
std::string prefix = "local://" + config_.ha_dir;
options.log_uri = prefix + "/log";
options.raft_meta_uri = prefix + "/raft_meta";
options.snapshot_uri = prefix + "/snapshot";
options.last_log_index = 1024;
if (braft::bootstrap(options)) {
    ERR_STREAM(logger_) << "Fail to init raft node";
    ::ns::StateMachine::Stop();
    return;
}
ipconfigme commented 5 years ago

[braft/src/braft/node.cpp:384]: Check failed: _log_manager->last_log_index() == options.last_log_index (1027 vs 1024).

bootstrap中检查last_log_index是否与输入一致,如果不一致就会报错。你为什么需要调用bootstrap呢,再就是调用的参数是否设置正确

hjk41 commented 5 years ago

我的load_snapshot实现有问题,估计是这个原因才导致braft出错了。

至于用bootstrap,因为我的初始数据是用工具导入的,也就是说,所有的replica一启动就是有数据的状态。不用bootstrap的话有空白的机器加进来就不会自动复制初始数据了。不过也许我应该让用户自己保证所有加进来的机器都有初始数据。。。

ipconfigme commented 5 years ago

是不是只有在初始导入的时候用bootstrap就可以了,后续副本的增删就用braft的add_peer/remove_peer来实现是不是更好。

hjk41 commented 5 years ago

嗯,现在是这么做的。问题的确出在snapshot上,改好了就好了