Closed cheniujh closed 2 weeks ago
The recent update introduces error handling enhancements and synchronization improvements for the Rsync process in the database replication system. A new method, IsRsyncErrorStopped
, has been added to check for errors during Rsync operations. Adjustments were also made in logging and error state management in the replication manager and Rsync client, ensuring more robust replication and error handling.
Files | Change Summary |
---|---|
include/pika_rm.h |
Added IsRsyncErrorStopped method in SyncSlaveDB to check Rsync error state. |
include/rsync_client.h |
Introduced IsErrorStopped method and error_stopped_ member in RsyncClient to handle Rsync error states. |
src/pika_rm.cc |
Integrated Rsync error state checks in PikaReplicaManager , updating replication state and logging warnings. |
src/pika_server.cc |
Modified comparison logic in TryDBSync function using static_cast<int32_t> and static_cast<int64_t> for consistency. |
src/rsync_client.cc |
Updated Copy , ThreadMain , and Init functions to manage and log Rsync error states. |
In the realm where bytes do dance,
A sync of slaves gets a chance.
Errors now shall halt in cue,
Checking states both tried and true.
Logs will tell a tale so fine,
For syncing flows on every line.
🌟
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
当初实现rsync_client时,master在给某一个slave同步数据的过程中,生成了新的dump,期望的执行流程如下:
期望的流程现在有几个问题需要修改:
Bot detected the issue body's language is not English, translate it automatically.
When rsync_client was first implemented, the master generated a new dump during the process of synchronizing data to a slave. The expected execution process is as follows:
- The slave compares the snapshot_uuid and finds that the snapshot_uuid does not match, and exits the process of rsyncclient pulling data.
- The outer state machine is still in the waitdbsync state. When it is detected that rsyncclient has completed the data pull, execute TryUpdateMasterOffset to try to renamedb and update the binlog offset.
- TryUpdateMasterOffset finds that the info file has not been transferred yet, and thinks that the master-slave synchronization has not been completed. Transfer the state machine to trysync.
- In the trysync state, re-execute the process of rsyncclient pulling data.
The desired process now has several issues that need to be modified:
当初实现rsync_client时,master在给某一个slave同步数据的过程中,生成了新的dump,期望的执行流程如下:
- slave通过比对snapshot_uuid,发现snapshot_uuid不匹配,退出rsyncclient拉取数据的流程。
- 外层状态机状态还处于waitdbsync状态,当检测到rsyncclient已经执行完数据拉取之后,执行TryUpdateMasterOffset尝试renamedb以及更新binlog offset。
- TryUpdateMasterOffset发现info文件还没有传输完,认为主从同步没有全部结束。流转状态机到trysync。
- trysync状态,重新执行rsyncclient拉取数据的流程。
期望的流程现在有几个问题需要修改:
- 需要保证最后拉取info文件,这样的话,TryUpdateMasterOffset中如果发现有info文件,那么其他文件就都拉取完成了。
- TryUpdateMasterOffset如果发现没有info文件,需要将状态改出,状态机流转到tryconnect,重新来一把全量复制。
OK, 状态流转另提PR处理
Bot detected the issue body's language is not English, translate it automatically.
When rsync_client was originally implemented, the master generated a new dump during the process of synchronizing data to a slave. The expected execution process is as follows:
- The slave compares the snapshot_uuid and finds that the snapshot_uuid does not match, and exits the process of rsyncclient pulling data.
- The outer state machine is still in the waitdbsync state. After detecting that rsyncclient has completed data pull, execute TryUpdateMasterOffset to try to renamedb and update the binlog offset.
- TryUpdateMasterOffset finds that the info file has not been transferred completely, and thinks that the master-slave synchronization has not been completed. Transfer the state machine to trysync.
- In trysync state, re-execute the process of rsyncclient pulling data.
The desired process now has several issues that need to be modified:
- You need to ensure that the info file is pulled last. In this case, if an info file is found in TryUpdateMasterOffset, then all other files will be pulled.
- If TryUpdateMasterOffset finds that there is no info file, the state needs to be changed, the state machine flows to tryconnect, and the full copy is made again.
OK, status transfer will be dealt with separately by PR.
这个PR修复了Issue #2742
Issue 问题表述:线上针对一个Pika实例进行扩容时,短时间内让多个slave节点连接到该实例的Master节点上时,出现了从节点没有做全量同步,数据还没拉过来,后面就成功建立增量连接的情况。
具体梳理:主节点日志上发现这几次建联请求触发了多次bgSave,从节点日志梳理发现在Slave侧的RsyncClient在拉取文件的时候拉到一半会出现主返回的RsyncResp的code为非kOK,且从节点会出现提示Master端有了新snapshot uuid的WARNING日志:W20240618` 14:57:06.052378 19221 rsync_client.cc:218] receive newer dump, reset state to STOP...
反常的地方:
原因以及修复
该PR中进行的修复: 这里计算时都转为int64_t。
This PR fixes Issue #2742
Issue Description: When expanding a Pika instance online, multiple slave nodes are connected to the Master node of the instance in a short period of time, resulting in the slave nodes not performing a full synchronization. The data is not fully pulled over before an incremental connection is successfully established.
Detailed Analysis: The master node logs show that these connection requests triggered multiple bgSaves. The slave node logs revealed that the RsyncClient on the slave side failed to pull files correctly, with the RsyncResp code from the master being non-kOK halfway through the process. Additionally, there were WARNING logs indicating that a new snapshot uuid had appeared on the master:
W20240618 14:57:06.052378 19221 rsync_client.cc:218] receive newer dump, reset state to STOP...
Abnormal Points:
Cause and Fixes
Fix: Convert both values to int64_t during the calculation.
Summary by CodeRabbit
New Features
Improvements