follower 节点 on_apply 时发现数据损坏

niukuo commented 2 years ago

版本：v1.1.2 环境：进程内只有单个raft group，共有5个节点，负载较高。现象：一台follower在on_apply时发现数据有问题。 apply前后几个小时内没有发生过leader切换。 leader和其它3台节点的数据正确。经过分析发现segment中的数据与checksum一致，可以确定不是磁盘损坏造成的，应该是落盘前计算checksum时的数据就有问题。 raft日志大小 27533 字节。发现数据中总共有7处数据被损坏，分别是几处1、2、4字节的修改。被修改的数据如下，3列分别是偏移量，正确数据，错误数据。 std::map<int64_t, std::pair<std::string, std::string> > reps = { { 0x2fb1, { "\x6f\x72\x2f\x73" / "or/s" /, "\x21\x5c\x8d\x76" } }, { 0x2fb7, { "\x5f\x61" / "_a" /, "\x07\x62" } }, { 0x2fc1, { "\x73", "\x74" } }, { 0x2fd3, { "\x0a", "\x0c" } }, { 0x3039, { "\x61\x0a\x5f\x2f", "\x2e\x98\xbc\x32" } }, { 0x3051, { "\x31", "\x32" } }, { 0x306b, { "\x6e", "\x70" } }, };

有没有人遇到过类似的问题，或者能否提供一些调查方向。还做了一些验证，segment文件尚未闭合（inprogress），大小约64m。日志序号从到 , 共6567条日志。我们程序中本身包含checksum，可以确定只有一条日志损坏，序号是 27900350452，后面几条都没有损坏。

分析可能的原因

leader节点发出的数据有误。概率不大，leader和另外3个follower的数据经过checksum验证都是正确的。
磁盘数据损坏排除。segment中有checksum，读取时校验正确。手动任意修改一个字节后读取会报错。
follower 节点从接到 rpc 请求到落盘前过程中数据引用计数被提前清0 需要研究一下braft代码，看是否有这个可能。前面的7处修改有2个4字节覆盖，4处被+1或+2
程序有bug，野指针导致的非法写入。概率不大。我们的程序已经上线运行几年，几十个group共计100多个进程，有一半的group负载都非常高，之前没有遇到过这样的问题，也很久没有遇到过coredump。
其它原因有人遇到过类似问题的话能否提供些思路

PFZheng commented 2 years ago

盘上的数据没有问题？

niukuo commented 2 years ago

盘上的数据没有问题？

磁盘上的数据与磁盘上的checksum是一致的都是错误的数据

niukuo commented 2 years ago

另外还有一个现象正常运行时即使没有发生leader切换，也会时不时打一些 reject term_unmatched 的日志，来自leader的AppendEntries的请求。

[2022-01-03 11:22:19.361114] [WARNING] [110087] [external/com_github_brpc_braft/src/braft/node.cpp:2443] :node kv:11.53.255.151:7764:0 reject term_unmatched AppendEntries from 11.53.255.85:7764:0 in term 120 prev_log_index 98017631735 prev_log_term 120 local_prev_log_term 0 last_log_index 98017631733 entries_size 0 from_append_entries_cache: 0 [2022-01-03 11:22:22.985227] [WARNING] [111154] [external/com_github_brpc_braft/src/braft/node.cpp:1298] :node kv:11.53.255.151:7764:0 got error={type=StateMachineError, error_code=10002, error_text=`StateMachine meet critical error when applying one or more tasks since index=98017633802, none'}

niukuo commented 2 years ago

另外还有一个现象正常运行时即使没有发生leader切换，也会时不时打一些 reject term_unmatched 的日志，来自leader的AppendEntries的请求。

[2022-01-03 11:22:19.361114] [WARNING] [110087] [external/com_github_brpc_braft/src/braft/node.cpp:2443] :node kv:11.53.255.151:7764:0 reject term_unmatched AppendEntries from 11.53.255.85:7764:0 in term 120 prev_log_index 98017631735 prev_log_term 120 local_prev_log_term 0 last_log_index 98017631733 entries_size 0 from_append_entries_cache: 0 [2022-01-03 11:22:22.985227] [WARNING] [111154] [external/com_github_brpc_braft/src/braft/node.cpp:1298] :node kv:11.53.255.151:7764:0 got error={type=StateMachineError, error_code=10002, error_text=`StateMachine meet critical error when applying one or more tasks since index=98017633802, none'}

我们用的是v1.1.2，可能版本比较早。这个日志看到在后面的commit中修复了，请忽略。

PFZheng commented 2 years ago

另外还有一个现象正常运行时即使没有发生leader切换，也会时不时打一些 reject term_unmatched 的日志，来自leader的AppendEntries的请求。

[2022-01-03 11:22:19.361114] [WARNING] [110087] [external/com_github_brpc_braft/src/braft/node.cpp:2443] :node kv:11.53.255.151:7764:0 reject term_unmatched AppendEntries from 11.53.255.85:7764:0 in term 120 prev_log_index 98017631735 prev_log_term 120 local_prev_log_term 0 last_log_index 98017631733 entries_size 0 from_append_entries_cache: 0 [2022-01-03 11:22:22.985227] [WARNING] [111154] [external/com_github_brpc_braft/src/braft/node.cpp:1298] :node kv:11.53.255.151:7764:0 got error={type=StateMachineError, error_code=10002, error_text=`StateMachine meet critical error when applying one or more tasks since index=98017633802, none'}

看起来是follower内存里的数据就错了

niukuo commented 2 years ago

另外还有一个现象正常运行时即使没有发生leader切换，也会时不时打一些 reject term_unmatched 的日志，来自leader的AppendEntries的请求。

[2022-01-03 11:22:19.361114] [WARNING] [110087] [external/com_github_brpc_braft/src/braft/node.cpp:2443] :node kv:11.53.255.151:7764:0 reject term_unmatched AppendEntries from 11.53.255.85:7764:0 in term 120 prev_log_index 98017631735 prev_log_term 120 local_prev_log_term 0 last_log_index 98017631733 entries_size 0 from_append_entries_cache: 0 [2022-01-03 11:22:22.985227] [WARNING] [111154] [external/com_github_brpc_braft/src/braft/node.cpp:1298] :node kv:11.53.255.151:7764:0 got error={type=StateMachineError, error_code=10002, error_text=`StateMachine meet critical error when applying one or more tasks since index=98017633802, none'}

看起来是follower内存里的数据就错了

是的我在 handle_append_entries_request 中加了业务层的checksum校验，等复现后先确定一下leader发出的数据是否正确

vpshastry commented 5 months ago

Environment: Version: v1.1.1 DB: RocksDB

I have seen data inconsistency between replicas quite frequently. The (possible) corrupted data match on couple of replicas, so I think the possibility of this issue is more to do with BRaft than RocksDB. But we have not yet been able to pin-point the issue. I'll provide more information if when I have something.

（使用在线工具翻译）环境：版本：v1.1.1 数据库：RocksDB

我经常看到副本之间的数据不一致。几个副本上（可能）有损坏的数据匹配，所以我认为这个问题的可能性与 BRaft 有关，而不是 RocksDB。但我们尚未能够确定问题所在。如果有信息，我会提供更多信息。

ehds commented 5 months ago

Environment: Version: v1.1.1 DB: RocksDB

I have seen data inconsistency between replicas quite frequently. The (possible) corrupted data match on couple of replicas, so I think the possibility of this issue is more to do with BRaft than RocksDB. But we have not yet been able to pin-point the issue. I'll provide more information if when I have something.

（使用在线工具翻译）环境：版本：v1.1.1 数据库：RocksDB

我经常看到副本之间的数据不一致。几个副本上（可能）有损坏的数据匹配，所以我认为这个问题的可能性与 BRaft 有关，而不是 RocksDB。但我们尚未能够确定问题所在。如果有信息，我会提供更多信息。

Based on my previous experience, I think you should troubleshoot the problem from the following aspects:

Can on_apply function you implemention ensure all nodes generate the consistency of data for the same log. For example, If something wrong when apply log, did you revert to the correct log index? If not maybe some log would be apply twice.
Can the on_snapshot_save you implemention ensure that the complete state of the database at the snapshot index point has been persisted?
Can the on_snapshot_load you implemention ensure that you can load the correct state from the snapshot.
Besides, did you make any changes to the braft source code ?

vpshastry commented 5 months ago

Thanks for your response.

Let me clarify what I mean by "quite frequently". It is in relative terms, so its might not mean every day or even every week. We see it "once in a while" in comparison to its lifetime of all deployment. But in the grand schema its pretty "often".
We also have hit couple of other issues in RocksDB, so we haven't ruled out the issue to be in RocksDB.
May be one thing that is different with our system is our normal way of shutting down the service is to kill -9. So, the abrupt shutdown may cause many more issues that is not observed in regular testing.
The inconsistency / divergence we see between replica happens after many many rounds of cluster restarts. Generally that includes many leadership changes. But there is no configuration update request from the application. This is one of the main difference to the issue reported here.

（使用在线工具翻译）

感谢您的回复。

让我澄清一下“相当频繁”的意思。这是相对而言的，所以它可能不是指每天甚至每周。与所有部署的生命周期相比，我们“偶尔”看到它。但在宏观上，它相当“频繁”。
我们还遇到了 RocksDB 中的其他几个问题，因此我们不能排除 RocksDB 中存在该问题的可能性。
我们的系统可能有一个不同之处，那就是我们关闭服务的正常方式是“kill -9”。因此，突然关闭可能会导致更多在常规测试中未观察到的问题。
我们看到的副本之间的不一致/分歧发生在多轮集群重启之后。通常这包括许多领导层变更。但没有来自应用程序的配置更新请求。这是与此处报告的问题的主要区别之一。

Based on my previous experience, I think you should troubleshoot the problem from the following aspects:

Can on_apply function you implemention ensure all nodes generate the consistency of data for the same log. For example, If something wrong when apply log, did you revert to the correct log index? If not maybe some log would be apply twice.

We exit the VNode / replica group if we have any issues in applying the log. We haven't seen any exits until now, so it would be a surprise if it happens. The only other place it might be a problem is network layer.

（使用在线工具翻译）如果在应用日志时遇到任何问题，我们将退出 VNode/副本组。到目前为止，我们还没有看到任何退出，所以如果发生这种情况，那将是一个惊喜。可能出现问题的唯一其他地方是网络层。

Can the on_snapshot_save you implemention ensure that the complete state of the database at the snapshot index point has been persisted?

Can the on_snapshot_load you implemention ensure that you can load the correct state from the snapshot.

Its a pretty straightforward code to call the RocksDB snapshot and we don't have any additional logic over there. So, I wouldn't suspect issues over here. We also do the snapshot inline, within the context of the FSM call. That should greatly help take out any inconsistency generated from snapshotting / restore logic.

（使用在线工具翻译）调用 RocksDB 快照的代码非常简单，我们在那里没有任何额外的逻辑。所以，我不会怀疑这里的问题。我们还在 FSM 调用的上下文中以内联方式执行快照。这应该有助于消除快照/恢复逻辑产生的任何不一致。

Besides, did you make any changes to the braft source code ?

We have changes only to integrate with build system. So, no source code changes.

（使用在线工具翻译）我们所做的更改只是为了与构建系统集成。因此，源代码没有更改。

baidu / braft

follower 节点 on_apply 时发现数据损坏 #346