baidu / braft

An industrial-grade C++ implementation of RAFT consensus algorithm based on brpc, widely used inside Baidu to build highly-available distributed systems.
Apache License 2.0
3.99k stars 886 forks source link

follower 节点 on_apply 时发现数据损坏 #346

Open niukuo opened 2 years ago

niukuo commented 2 years ago

版本:v1.1.2 环境:进程内只有单个raft group,共有5个节点,负载较高。 现象:一台follower在on_apply时发现数据有问题。 apply前后几个小时内没有发生过leader切换。 leader和其它3台节点的数据正确。 经过分析发现segment中的数据与checksum一致,可以确定不是磁盘损坏造成的,应该是落盘前计算checksum时的数据就有问题。 raft日志大小 27533 字节。发现数据中总共有7处数据被损坏,分别是几处1、2、4字节的修改。 被修改的数据如下,3列分别是 偏移量,正确数据,错误数据。 std::map<int64_t, std::pair<std::string, std::string> > reps = { { 0x2fb1, { "\x6f\x72\x2f\x73" / "or/s" /, "\x21\x5c\x8d\x76" } }, { 0x2fb7, { "\x5f\x61" / "_a" /, "\x07\x62" } }, { 0x2fc1, { "\x73", "\x74" } }, { 0x2fd3, { "\x0a", "\x0c" } }, { 0x3039, { "\x61\x0a\x5f\x2f", "\x2e\x98\xbc\x32" } }, { 0x3051, { "\x31", "\x32" } }, { 0x306b, { "\x6e", "\x70" } }, };

有没有人遇到过类似的问题,或者能否提供一些调查方向。 还做了一些验证,segment文件尚未闭合(inprogress),大小约64m。 日志序号 从 , 共6567条日志。 我们程序中本身包含checksum,可以确定只有一条日志损坏,序号是 27900350452,后面几条都没有损坏。

分析可能的原因

  1. leader节点发出的数据有误。 概率不大,leader和另外3个follower的数据经过checksum验证都是正确的。

  2. 磁盘数据损坏 排除。segment中有checksum,读取时校验正确。手动任意修改一个字节后读取会报错。

  3. follower 节点从接到 rpc 请求到落盘前过程中数据引用计数被提前清0 需要研究一下braft代码,看是否有这个可能。前面的7处修改有2个4字节覆盖,4处被+1或+2

  4. 程序有bug,野指针导致的非法写入。 概率不大。我们的程序已经上线运行几年,几十个group共计100多个进程,有一半的group负载都非常高,之前没有遇到过这样的问题,也很久没有遇到过coredump。

  5. 其它原因 有人遇到过类似问题的话能否提供些思路

PFZheng commented 2 years ago

盘上的数据没有问题?

niukuo commented 2 years ago

盘上的数据没有问题?

磁盘上的数据与磁盘上的checksum是一致的 都是错误的数据

niukuo commented 2 years ago

另外还有一个现象 正常运行时即使没有发生leader切换,也会时不时打一些 reject term_unmatched 的日志,来自leader的AppendEntries的请求。

[2022-01-03 11:22:19.361114] [WARNING] [110087] [external/com_github_brpc_braft/src/braft/node.cpp:2443] :node kv:11.53.255.151:7764:0 reject term_unmatched AppendEntries from 11.53.255.85:7764:0 in term 120 prev_log_index 98017631735 prev_log_term 120 local_prev_log_term 0 last_log_index 98017631733 entries_size 0 from_append_entries_cache: 0 [2022-01-03 11:22:22.985227] [WARNING] [111154] [external/com_github_brpc_braft/src/braft/node.cpp:1298] :node kv:11.53.255.151:7764:0 got error={type=StateMachineError, error_code=10002, error_text=`StateMachine meet critical error when applying one or more tasks since index=98017633802, none'}

niukuo commented 2 years ago

另外还有一个现象 正常运行时即使没有发生leader切换,也会时不时打一些 reject term_unmatched 的日志,来自leader的AppendEntries的请求。

[2022-01-03 11:22:19.361114] [WARNING] [110087] [external/com_github_brpc_braft/src/braft/node.cpp:2443] :node kv:11.53.255.151:7764:0 reject term_unmatched AppendEntries from 11.53.255.85:7764:0 in term 120 prev_log_index 98017631735 prev_log_term 120 local_prev_log_term 0 last_log_index 98017631733 entries_size 0 from_append_entries_cache: 0 [2022-01-03 11:22:22.985227] [WARNING] [111154] [external/com_github_brpc_braft/src/braft/node.cpp:1298] :node kv:11.53.255.151:7764:0 got error={type=StateMachineError, error_code=10002, error_text=`StateMachine meet critical error when applying one or more tasks since index=98017633802, none'}

我们用的是v1.1.2,可能版本比较早。这个日志看到在后面的commit中修复了,请忽略。

PFZheng commented 2 years ago

另外还有一个现象 正常运行时即使没有发生leader切换,也会时不时打一些 reject term_unmatched 的日志,来自leader的AppendEntries的请求。

[2022-01-03 11:22:19.361114] [WARNING] [110087] [external/com_github_brpc_braft/src/braft/node.cpp:2443] :node kv:11.53.255.151:7764:0 reject term_unmatched AppendEntries from 11.53.255.85:7764:0 in term 120 prev_log_index 98017631735 prev_log_term 120 local_prev_log_term 0 last_log_index 98017631733 entries_size 0 from_append_entries_cache: 0 [2022-01-03 11:22:22.985227] [WARNING] [111154] [external/com_github_brpc_braft/src/braft/node.cpp:1298] :node kv:11.53.255.151:7764:0 got error={type=StateMachineError, error_code=10002, error_text=`StateMachine meet critical error when applying one or more tasks since index=98017633802, none'}

看起来是follower内存里的数据就错了

niukuo commented 2 years ago

另外还有一个现象 正常运行时即使没有发生leader切换,也会时不时打一些 reject term_unmatched 的日志,来自leader的AppendEntries的请求。

[2022-01-03 11:22:19.361114] [WARNING] [110087] [external/com_github_brpc_braft/src/braft/node.cpp:2443] :node kv:11.53.255.151:7764:0 reject term_unmatched AppendEntries from 11.53.255.85:7764:0 in term 120 prev_log_index 98017631735 prev_log_term 120 local_prev_log_term 0 last_log_index 98017631733 entries_size 0 from_append_entries_cache: 0 [2022-01-03 11:22:22.985227] [WARNING] [111154] [external/com_github_brpc_braft/src/braft/node.cpp:1298] :node kv:11.53.255.151:7764:0 got error={type=StateMachineError, error_code=10002, error_text=`StateMachine meet critical error when applying one or more tasks since index=98017633802, none'}

看起来是follower内存里的数据就错了

是的 我在 handle_append_entries_request 中加了业务层的checksum校验,等复现后先确定一下leader发出的数据是否正确

vpshastry commented 5 months ago

Environment: Version: v1.1.1 DB: RocksDB

I have seen data inconsistency between replicas quite frequently. The (possible) corrupted data match on couple of replicas, so I think the possibility of this issue is more to do with BRaft than RocksDB. But we have not yet been able to pin-point the issue. I'll provide more information if when I have something.

(使用在线工具翻译) 环境: 版本:v1.1.1 数据库:RocksDB

我经常看到副本之间的数据不一致。几个副本上(可能)有损坏的数据匹配,所以我认为这个问题的可能性与 BRaft 有关,而不是 RocksDB。但我们尚未能够确定问题所在。如果有信息,我会提供更多信息。

ehds commented 5 months ago

Environment: Version: v1.1.1 DB: RocksDB

I have seen data inconsistency between replicas quite frequently. The (possible) corrupted data match on couple of replicas, so I think the possibility of this issue is more to do with BRaft than RocksDB. But we have not yet been able to pin-point the issue. I'll provide more information if when I have something.

(使用在线工具翻译) 环境: 版本:v1.1.1 数据库:RocksDB

我经常看到副本之间的数据不一致。几个副本上(可能)有损坏的数据匹配,所以我认为这个问题的可能性与 BRaft 有关,而不是 RocksDB。但我们尚未能够确定问题所在。如果有信息,我会提供更多信息。

Based on my previous experience, I think you should troubleshoot the problem from the following aspects:

  1. Can on_apply function you implemention ensure all nodes generate the consistency of data for the same log. For example, If something wrong when apply log, did you revert to the correct log index? If not maybe some log would be apply twice.
  2. Can the on_snapshot_save you implemention ensure that the complete state of the database at the snapshot index point has been persisted?
  3. Can the on_snapshot_load you implemention ensure that you can load the correct state from the snapshot.
  4. Besides, did you make any changes to the braft source code ?
vpshastry commented 5 months ago

Thanks for your response.

(使用在线工具翻译)

感谢您的回复。

Based on my previous experience, I think you should troubleshoot the problem from the following aspects:

  1. Can on_apply function you implemention ensure all nodes generate the consistency of data for the same log. For example, If something wrong when apply log, did you revert to the correct log index? If not maybe some log would be apply twice.

We exit the VNode / replica group if we have any issues in applying the log. We haven't seen any exits until now, so it would be a surprise if it happens. The only other place it might be a problem is network layer.

(使用在线工具翻译) 如果在应用日志时遇到任何问题,我们将退出 VNode/副本组。到目前为止,我们还没有看到任何退出,所以如果发生这种情况,那将是一个惊喜。可能出现问题的唯一其他地方是网络层。

  1. Can the on_snapshot_save you implemention ensure that the complete state of the database at the snapshot index point has been persisted?
  2. Can the on_snapshot_load you implemention ensure that you can load the correct state from the snapshot.

Its a pretty straightforward code to call the RocksDB snapshot and we don't have any additional logic over there. So, I wouldn't suspect issues over here. We also do the snapshot inline, within the context of the FSM call. That should greatly help take out any inconsistency generated from snapshotting / restore logic.

(使用在线工具翻译) 调用 RocksDB 快照的代码非常简单,我们在那里没有任何额外的逻辑。所以,我不会怀疑这里的问题。我们还在 FSM 调用的上下文中以内联方式执行快照。这应该有助于消除快照/恢复逻辑产生的任何不一致。

  1. Besides, did you make any changes to the braft source code ?

We have changes only to integrate with build system. So, no source code changes.

(使用在线工具翻译) 我们所做的更改只是为了与构建系统集成。因此,源代码没有更改。