apache / incubator-pegasus

Apache Pegasus - A horizontally scalable, strongly consistent and high-performance key-value store
https://pegasus.apache.org/
Apache License 2.0
1.99k stars 312 forks source link

data loss after restarting #719

Open ZhongChaoqiang opened 3 years ago

ZhongChaoqiang commented 3 years ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? after restarting, data in replica my be loss. There are some messages in the log. E2020-12-03 21:27:57.929 (1607002077929057014 3b14) replica.replica0.04010000000000d9: replica_init.cpp:313:init_app_and_prepare_list(): 35.24@10.32.82.225:34801: open replica failed, err = ERR_INCOMPLETE_DATA

W2020-12-03 21:27:57.951 (1607002077951756963 3b14) replica.replica0.04010000000000d9: replica_init.cpp:188:load(): load_replica: {replica_dir_op} succeed to move directory '/opt/huawei/data1/lemondb_data/replica/reps/35.24.pegasus' to '/opt/huawei/data1/lemondb_data/replica/reps/35.24.pegasus.1607002077951725.err'

  1. What did you expect to see? restart Successfully

  2. What did you see instead? data loss after restarting

  3. What version of Pegasus are you using? 1.12.0

ZhongChaoqiang commented 3 years ago

下面是分析过程:

  1. last_durable_decree为上一次持久化的点,打开replica时从rocksdb的数据目录中读取出来,而init_durable_decree是从.init-info文件中读取出来。.init-info文件一般是在learn的时候生成,我们继续分析日志,发现replica在发生问题前是有发生learn的操作的,并且有如下日志: D2020-12-03 21:27:09.499 (1607002029499473085 7754) replica.rep_long3.0404000d000021bd: replica_learn.cpp:1075:on_copy_remote_state_completed(): 35.24@10.32.82.225:34801: on_copy_remote_state_completed[000001a600000002]: learnee = 10.32.82.16:34801, learn_duration = 5 ms, apply checkpoint/log done, err = ERR_OK, last_prepared_decree = (16369 => 16371), last_committed_decree = (16368 => 16371), app_committed_decree = (16368 => 16371), app_durable_decree = (16368 => 16368), remote_committed_decree = 16371, prepare_start_decree = -1, current_learning_status = replication::learner_status::LearningWithoutPrepare

可以发现经过learn后,last_committed_decree变成了16371,而app_durable_decree还是16368。我们继续分析代码,发现代码中learn后写入.init-info文件中的init_durable_decree的值并不是last_durable_decree的值,而是last_committed_decree的值。

err = _app->update_init_info(
      this,
      _stub->_log->on_partition_reset(get_gpid(), _app->last_committed_decree()),
      _private_log->on_partition_reset(get_gpid(), _app->last_committed_decree()),
      _app->last_committed_decree());

init_durable_decree应该是记录last_durable_decree的值,而不是last_committed_decree的值;last_committed_decree表示上一次已经提交的修改,但不一定已经持久化了。所以last_committed_decree总是大于或等于last_durable_decree的值的。

zhangyifan27 commented 3 years ago

@ZhongChaoqiang 能具体描述一下这个问题要怎么复现吗?我测试了下learn之后重启,并没有复现出ERR_INCOMPLETE_DATA相关的log。 另外出现这个错误之后,open replica失败,replica及相关数据会被删除,replica server会重新learn一份数据,应该是不会丢数据的,实际中你们发现丢数据的现象了吗?

看了下replica start这块的逻辑:https://github.com/apache/incubator-pegasus/blob/a948e89b180b6a5c82d298d0dcc65f7bb770a8be/src/server/pegasus_server_impl.cpp#L1706-L1752 _last_committed_decree被初始化为rocksdb的last_flushed_decree,随后_last_durable_decree被更新成和_last_committed_decree一样的值,如果出现last_durable_decree() < _info.init_durable_decree,说明replica重启之前可能没有正常flush,导致没有更新last_flushed_decree:https://github.com/apache/incubator-pegasus/blob/a948e89b180b6a5c82d298d0dcc65f7bb770a8be/src/server/pegasus_server_impl.cpp#L1805-L1812 我觉得你可以看下是不是replica server进程退出时没有执行flush导致的这个问题?

_last_committed_decree在数据成功写入log和memtable之后会更新,而_last_durable_decree好像是打checkpoint时才会更新(这部分不太确定, @neverchanje 可以解答下),不太确定.init_info里面记录_last_durable_decree有没有问题。

另外update_init_info 这块有个注释是说在learn之后需要再一次打checkpoint否则app::open_internal会出错(好像就是你提的这个问题): https://github.com/XiaoMi/rdsn/blob/master/src/replica/replica_learn.cpp#L769-L779 所以我觉得可能本身就是这么设计的,init_info里面需要写入_last_committed_decree,至于为什么可能需要其他committer解答下:)

ZhongChaoqiang commented 3 years ago

@zhangyifan27 谢谢你的恢复!不好意思啊,没有及时看到你的信息! 这个问题比较久远了,我们是在现网碰到的这个问题,具体怎么操作导致的这个问题当时也没有具体说明,但是应该是有多次的重启操作的。 由于我们把err的清理时间(gc_disk_error_replica_interval_seconds)设置过短了,当时的确是丢数据了。(primary的机器下线了,secondary的节点open的时候把数据移到了err目录),所以用户读不到数据了。

还有一个关键日志:在open replica的时候,打印了如下日志。

E2020-12-03 21:27:57.929 (1607002077929018092 3b14) replica.replica0.04010000000000d9: replication_app_base.cpp:347:open_internal(): 35.24@10.32.82.225:34801: replica data is not complete coz last_durable_decree(16368) < init_durable_decree(16371)
E2020-12-03 21:27:57.929 (1607002077929048291 3b14) replica.replica0.04010000000000d9: replication_app_base.cpp:353:open_internal(): 35.24@10.32.82.225:34801: open replica app return ERR_INCOMPLETE_DATA

这里表明,open replica的时候,我们记录的last_durable_decree是不能少于从.init-info读取出来的init_durable_decree的。因为这样有可能是以为丢失了数据,此时的replica数据应该是不完整的了,所以需要把数据移到err目录下。

怎么样保证last_durable_decree不少于init_durable_decree?由于我们对degree的机制了解的不是很深,目前我们想到的办法是last_committed_decree改成last_durable_decree,应该可以优化这个问题。

怎么重现这个问题我好像没有想到好的方法,所以也是从代码上去分析这个问题的。麻烦你们再帮忙看看是不是会存在这样的问题?谢谢!