Closed cheniujh closed 2 months ago
The main changes enhance error handling and state management in the DB::TryUpdateMasterOffset
and RsyncClient
classes. Specifically, logging levels were adjusted, new error conditions were added, and additional actions were taken to improve synchronization and stability of these components.
Files | Change Summary |
---|---|
src/pika_db.cc |
Adjusted logging levels, error handling, and synchronization actions in DB::TryUpdateMasterOffset |
src/rsync_client.cc |
Enhanced error handling, logging, and state management methods: Copy , ThreadMain , CopyRemoteFile , and ComparisonUpdate |
In code's vast plains, changes fly high,
Where errors once whispered, now logs loudly cry.
Sync's gentle dance with states refined,
By clever tweaks, stability defined.
Oh, code so bright, in structure tight,
Rabbits cheer as bugs take flight!
🐇✨
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
这个PR修复了 Issue #2742 中的第二个问题:
问题描述: 从节点在全量同步失败,RsynClient异常退出之后尝试进行增量连接竟然成功了,这是不对的。
原因: 一方面,RsyncClient如果异常退出,对于上层主从状态机来说没有别的信号告知,所以才会走了继续尝试增量连接的链路。另一方面,按理说如果全量失败,那么使用全量文件打开一个新RocksDB实例应当也是失败的(RocksDB Apply完manifest文件后会检查内存中的current version中的fileset是否和磁盘上的一致)。但这次case中,全量同步发生中断,恰好只拉取了部分SST文件,RocksDB的CURRENT, MANIFEST文件都没有拉过来,于是在Replace DB的阶段,RocksDB打开新实例时,因为找不到CURRENT文件,会直接起一个空实例,所以没有报错。
解决方案:
1 在RsynClient内部增加error_stop_标志位,如果RsyncClient异常退出(也就是全量同步异常退出,文件没有拉取完毕),就直接删除snapshot所对应的文件夹(./dbsync/dbx)
2 通过1中删除文件夹的做法,能在不提高RsyncClient和上层耦合的情况下,将错误状态以文件夹不存在的形式传递给上层的从节点状态机,从节点状态机发现snapshot文件夹不存在的话会将SlaveDB状态转为TryConnect进行连接重试
This PR fixes the second issue in Issue #2742:
Problem Description: The issue was that after a full synchronization was successful, the attempt to perform an incremental connection succeeded, which was incorrect.
Reason: In this case, the full synchronization was interrupted, and only some SST files were fetched; the CURRENT and MANIFEST files were not fetched. Therefore, during the Replace DB phase, when RocksDB opened a new instance, it started an empty instance directly because it couldn't find the CURRENT file, so no error was reported.
Solution:
error_stop_
flag insideRsyncClient
, which will directly delete the folder corresponding to the snapshot (./dbsync/dbx
) ifRsyncClient
exits abnormally (i.e., if the full synchronization exits abnormally and the files were not completely fetched).RsyncClient
can convey the error information to the upper-layer slave state machine by making the folder non-existent. If the slave node finds that the snapshot folder does not exist, it will switch the slave state toTryConnect
and attempt to reconnect.Summary by CodeRabbit
Bug Fixes
New Features