fix: unsigned int overflow when master-slave sync data

这个PR修复了Issue #2742

Issue 问题表述：线上针对一个Pika实例进行扩容时，短时间内让多个slave节点连接到该实例的Master节点上时，出现了从节点没有做全量同步，数据还没拉过来，后面就成功建立增量连接的情况。
具体梳理：主节点日志上发现这几次建联请求触发了多次bgSave，从节点日志梳理发现在Slave侧的RsyncClient在拉取文件的时候拉到一半会出现主返回的RsyncResp的code为非kOK，且从节点会出现提示Master端有了新snapshot uuid的WARNING日志：W20240618` 14:57:06.052378 19221 rsync_client.cc:218] receive newer dump, reset state to STOP...
反常的地方:
- 3.1 即使短时间内多个全量请求过来，应该做一份bgsave即可，不应该有多份snapshot产生
- 3.2 为什么从节点全量到一半失败了，后面还进入了增量同步
原因以及修复

4.1 针对3.1，排查到问题在于PikaServer::TryDBSync中是否做Bgsave的判断条件上，具体代码如下：

if (pstd::IsDir(bgsave_info.path) != 0 ||
  !pstd::FileExists(NewFileName(logger_filename, bgsave_info.offset.b_offset.filenum)) ||
  top - bgsave_info.offset.b_offset.filenum > kDBSyncMaxGap) {
// Need Bgsave first
db->BgSaveDB();
}

问题在于有符号数和无符号数运算时发生了溢出。具体地：top - bgsave_info.offset.b_offset.filenum > kDBSyncMaxGap， top是个int32_t，是slave的DBSync请求携带过来的Slave的最新binlog filenum, 而bgsave_info.offset.b_offset.filenum是个uint32_t, 是上一次bgsave产生时所对应的binlog filenum。这里一个是有符号数，一个是无符号数，运算时，双方都转成了无符号数来算，计算结果也是个无符号数。以复现的某次case举例：由于Slave是空的，所以top为0，而bgsave_info.offset.b_offset.filenum为1644，实际上发生的计算是0 - 1644，结果是负数，但结果类型是无符号类型，所以实际返回了接近uint32_t上限的一个正数。而KDBSyncMaxGap值为50， 所以top - bgsave_info.offset.b_offset.filenum > kDBSyncMaxGap这个判断一直会为true，也就是后面后面陆续到达的全量同步请求都会触发bgsave，产生的snapshot会把前面一份snapshot覆盖，而那份snapshot此时是正在被使用的，全量请求先到达的slave已经正在基于前面这份snapshot做全量，所以从节点日志会打印很多相关错误。

该PR中进行的修复: 这里计算时都转为int64_t。

4.2 针对3.2，一方面，RsyncClient如果异常退出，对于上层主从状态机来说没有别的信号告知，所以才会走了继续尝试增量连接的链路。另一方面，按理说如果全量失败，那么使用全量文件打开一个新RocksDB实例应当也是失败的（RocksDB Apply完manifest文件后会检查内存中的current version中的fileset是否和磁盘上的一致）。但这次case中，全量同步发生中断，恰好只拉取了部分SST文件，RocksDB的CURRENT, MANIFEST文件都没有拉过来，于是在Replace DB的阶段，RocksDB打开新实例时，因为找不到CURRENT文件，会直接起一个空实例，所以没有报错。 另提PR进行修复: #2756 。

This PR fixes Issue #2742

Issue Description: When expanding a Pika instance online, multiple slave nodes are connected to the Master node of the instance in a short period of time, resulting in the slave nodes not performing a full synchronization. The data is not fully pulled over before an incremental connection is successfully established.
Detailed Analysis: The master node logs show that these connection requests triggered multiple bgSaves. The slave node logs revealed that the RsyncClient on the slave side failed to pull files correctly, with the RsyncResp code from the master being non-kOK halfway through the process. Additionally, there were WARNING logs indicating that a new snapshot uuid had appeared on the master: W20240618 14:57:06.052378 19221 rsync_client.cc:218] receive newer dump, reset state to STOP...
Abnormal Points:
- 3.1 Even if multiple full synchronization requests occur in a short period, only one bgsave should be performed, and multiple snapshots should not be created.
- 3.2 Why did the slave nodes fail in the middle of full synchronization but still proceed to incremental synchronization?
Cause and Fixes

4.1 For point 3.1, the problem was found in the condition for performing Bgsave in PikaServer::TryDBSync. The specific code is as follows:
```
if (pstd::IsDir(bgsave_info.path) != 0 ||
  !pstd::FileExists(NewFileName(logger_filename, bgsave_info.offset.b_offset.filenum)) ||
  top - bgsave_info.offset.b_offset.filenum > kDBSyncMaxGap) {
// Need Bgsave first
db->BgSaveDB();
}
```
The issue is caused by an overflow when performing operations between signed and unsigned numbers. Specifically: top - bgsave_info.offset.b_offset.filenum > kDBSyncMaxGap, top is an int32_t, representing the latest binlog filenum from the slave's DBSync request, and bgsave_info.offset.b_offset.filenum is a uint32_t, representing the binlog filenum corresponding to the last bgsave. In this case, the operation converts both numbers to unsigned integers, and the result is also an unsigned integer. For example, in a reproduced case: since the slave is empty, top is 0, and bgsave_info.offset.b_offset.filenum is 1644. The actual calculation is 0 - 1644, which results in a negative number, but since the result type is unsigned, it returns a large positive value close to the uint32_t limit. The KDBSyncMaxGap value is 50, so the condition top - bgsave_info.offset.b_offset.filenum > kDBSyncMaxGap would always be true, meaning every subsequent full synchronization request would trigger a bgsave, and the snapshots generated would overwrite the previous one. Since the previous snapshot is already being used, and the earlier full synchronization slave is based on this snapshot, a lot of related errors would be logged on the slave node.

Fix: Convert both values to int64_t during the calculation.

4.2 For point 3.2, the cause is that the RsyncClient behaves the same for both abnormal stops and normal exits from the perspective of the upper-level master-slave state machine. Another PR is proposed to fix this problem:https://github.com/OpenAtomFoundation/pika/pull/2756

Summary by CodeRabbit

New Features
- Implemented a method to check if the Rsync error state has stopped in the replication process.
Improvements
- Enhanced error handling during the Rsync process, improving replication reliability.
- Adjusted comparison logic in database synchronization to ensure correct value types are used, enhancing stability.

Walkthrough

The recent update introduces error handling enhancements and synchronization improvements for the Rsync process in the database replication system. A new method, IsRsyncErrorStopped, has been added to check for errors during Rsync operations. Adjustments were also made in logging and error state management in the replication manager and Rsync client, ensuring more robust replication and error handling.

Changes

Files	Change Summary
`include/pika_rm.h`	Added `IsRsyncErrorStopped` method in `SyncSlaveDB` to check Rsync error state.
`include/rsync_client.h`	Introduced `IsErrorStopped` method and `error_stopped_` member in `RsyncClient` to handle Rsync error states.
`src/pika_rm.cc`	Integrated Rsync error state checks in `PikaReplicaManager`, updating replication state and logging warnings.
`src/pika_server.cc`	Modified comparison logic in `TryDBSync` function using `static_cast<int32_t>` and `static_cast<int64_t>` for consistency.
`src/rsync_client.cc`	Updated `Copy`, `ThreadMain`, and `Init` functions to manage and log Rsync error states.

Poem

In the realm where bytes do dance,
A sync of slaves gets a chance.
Errors now shall halt in cue,
Checking states both tried and true.
Logs will tell a tale so fine,
For syncing flows on every line.
🌟

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

- [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)

Tips

### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.

当初实现rsync_client时，master在给某一个slave同步数据的过程中，生成了新的dump，期望的执行流程如下：

slave通过比对snapshot_uuid，发现snapshot_uuid不匹配，退出rsyncclient拉取数据的流程。
外层状态机状态还处于waitdbsync状态，当检测到rsyncclient已经执行完数据拉取之后，执行TryUpdateMasterOffset尝试renamedb以及更新binlog offset。
TryUpdateMasterOffset发现info文件还没有传输完，认为主从同步没有全部结束。流转状态机到trysync。
trysync状态，重新执行rsyncclient拉取数据的流程。

期望的流程现在有几个问题需要修改：

需要保证最后拉取info文件，这样的话，TryUpdateMasterOffset中如果发现有info文件，那么其他文件就都拉取完成了。
TryUpdateMasterOffset如果发现没有info文件，需要将状态改出，状态机流转到tryconnect，重新来一把全量复制。

Bot detected the issue body's language is not English, translate it automatically.

When rsync_client was first implemented, the master generated a new dump during the process of synchronizing data to a slave. The expected execution process is as follows:

The slave compares the snapshot_uuid and finds that the snapshot_uuid does not match, and exits the process of rsyncclient pulling data.

The outer state machine is still in the waitdbsync state. When it is detected that rsyncclient has completed the data pull, execute TryUpdateMasterOffset to try to renamedb and update the binlog offset.

TryUpdateMasterOffset finds that the info file has not been transferred yet, and thinks that the master-slave synchronization has not been completed. Transfer the state machine to trysync.

In the trysync state, re-execute the process of rsyncclient pulling data.

The desired process now has several issues that need to be modified:

It is necessary to ensure that the info file is pulled last. In this case, if an info file is found in TryUpdateMasterOffset, then all other files will be pulled.
If TryUpdateMasterOffset finds that there is no info file, the state needs to be changed, the state machine flows to tryconnect, and the full copy is made again.

当初实现rsync_client时，master在给某一个slave同步数据的过程中，生成了新的dump，期望的执行流程如下：

slave通过比对snapshot_uuid，发现snapshot_uuid不匹配，退出rsyncclient拉取数据的流程。

外层状态机状态还处于waitdbsync状态，当检测到rsyncclient已经执行完数据拉取之后，执行TryUpdateMasterOffset尝试renamedb以及更新binlog offset。

TryUpdateMasterOffset发现info文件还没有传输完，认为主从同步没有全部结束。流转状态机到trysync。

trysync状态，重新执行rsyncclient拉取数据的流程。

期望的流程现在有几个问题需要修改：

需要保证最后拉取info文件，这样的话，TryUpdateMasterOffset中如果发现有info文件，那么其他文件就都拉取完成了。

TryUpdateMasterOffset如果发现没有info文件，需要将状态改出，状态机流转到tryconnect，重新来一把全量复制。

OK, 状态流转另提PR处理

Bot detected the issue body's language is not English, translate it automatically.

When rsync_client was originally implemented, the master generated a new dump during the process of synchronizing data to a slave. The expected execution process is as follows:

The slave compares the snapshot_uuid and finds that the snapshot_uuid does not match, and exits the process of rsyncclient pulling data.

The outer state machine is still in the waitdbsync state. After detecting that rsyncclient has completed data pull, execute TryUpdateMasterOffset to try to renamedb and update the binlog offset.

TryUpdateMasterOffset finds that the info file has not been transferred completely, and thinks that the master-slave synchronization has not been completed. Transfer the state machine to trysync.

In trysync state, re-execute the process of rsyncclient pulling data.

The desired process now has several issues that need to be modified:

You need to ensure that the info file is pulled last. In this case, if an info file is found in TryUpdateMasterOffset, then all other files will be pulled.

If TryUpdateMasterOffset finds that there is no info file, the state needs to be changed, the state machine flows to tryconnect, and the full copy is made again.

OK, status transfer will be dealt with separately by PR.

OpenAtomFoundation / pika