alibaba / MongoShake

MongoShake is a universal data replication platform based on MongoDB's oplog. Redundant replication and active-active replication are two most important functions. 基于mongodb oplog的集群复制工具,可以满足迁移和同步的需求,进一步实现灾备和多活功能。
GNU General Public License v3.0
1.72k stars 441 forks source link

mongoshake2.6.5版本增量同步 #669

Closed BcnJcaicai closed 2 years ago

BcnJcaicai commented 3 years ago

mongoshake 2.6.5 源端和目标端都是mongos,同步模式为all,在进入增量模式后,mongoshake所在机器内存快速增长甚至达到了200G,日志显示在进行chunk merge操作。 [2021/10/14 07:58:13 CST] [INFO] Collector-worker-25 transfer retransmit:false send [32] logs. reply_acked [7018397541473452038[1634098017, 6]], list_unack [0] [2021/10/14 07:58:13 CST] [INFO] Replayer-39 Executor-39 doSync oplogRecords received[1] merged[1]. merge to 100.00% chunks [2021/10/14 07:58:13 CST] [INFO] Collector-worker-39 transfer retransmit:false send [1] logs. reply_acked [7018494715108528178[1634120642, 4146]], list_unack [0] [2021/10/14 07:58:13 CST] [INFO] Replayer-37 Executor-37 doSync oplogRecords received[134] merged[5]. merge to 3.73% chunks [2021/10/14 07:58:13 CST] [INFO] Collector-worker-37 transfer retransmit:false send [134] logs. reply_acked [7018388526337097804[1634095918, 76]], list_unack [0] [2021/10/14 07:58:13 CST] [INFO] worker offset [7018411671915855940] use lowest 7018411671915855940[1634101307, 68] [2021/10/14 07:58:13 CST] [INFO] worker offset [7018418462259150886] use lowest 7018418462259150886[1634102888, 38] [2021/10/14 07:58:13 CST] [INFO] Replayer-28 Executor-28 doSync oplogRecords received[67] merged[3]. merge to 4.48% chunks [2021/10/14 07:58:13 CST] [INFO] Collector-worker-28 transfer retransmit:false send [67] logs. reply_acked [7018409034805936130[1634100693, 2]], list_unack [0] [2021/10/14 07:58:13 CST] [INFO] worker offset [7018388187034681390] use lowest 7018388187034681390[1634095839, 46]

现象出现过两次,第一次目标端没开启balance,第二次目标端开启balance,两次结果都是一样,mongoshake所在机器在进入增量后内存大幅上涨。

增量配置如下 incr_sync.mongo_fetch_method = oplog incr_sync.change_stream.watch_full_document = false incr_sync.oplog.gids = incr_sync.shard_key = collection incr_sync.shard_by_object_id_whitelist = incr_sync.worker = 16 incr_sync.tunnel.write_thread = 16 incr_sync.target_delay = 0 incr_sync.worker.batch_queue_size = 64 incr_sync.adaptive.batching_max_size = 1024 incr_sync.fetcher.buffer_capacity = 256 incr_sync.executor.upsert = false incr_sync.executor.insert_on_dup_update = false incr_sync.conflict_write_to = none incr_sync.executor.majority_enable = false

zhangst commented 3 years ago

请确认下系统监控,200G内存是否都是monggoshake。 两边都是mongos,建议使用changestream方式来做增量迁移

BcnJcaicai commented 3 years ago

改成changestream 同步增量后,速度远远落后,不知道是否有什么方法加快。 image 还有就是我的源端有50个分片,之前采用oplog的形式确实会导致mongoshake在增量同步的时候内存占用极高,我想应该是从50个分片拉取增量oplog导致的吧,changestream之所以可以是因为 我配置的速度慢吧

zhangst commented 3 years ago

如果shake机器、源和目的DB的负载都不高,以及没有积压,可以尝试提高增量更新的线程数来提高速度

BcnJcaicai commented 3 years ago

incr_sync.tunnel.write_thread 请问你说的是单独修改这个参数么还是说,incr_sync.worker.batch_queue_size = 64 incr_sync.adaptive.batching_max_size = 1024 incr_sync.fetcher.buffer_capacity = 256 这几个也连带修改,我看WIKI说FAQ里面有介绍这三个的 ,但是并没有找到具体详细的。

zhangst commented 3 years ago

incr_sync.worker = 8 修改这个参数,工作线程的数量

BcnJcaicai commented 2 years ago

[2021/11/19 17:02:28 CST] [WARN] insert docs with length[581] into ns[{xxx xxxxxx_2020-05-01}] of dest mongo failed[index[207], msg[Write results unavailable from xxx.xxx.xxx.xxx:port :: caused by :: Couldn't get a connection within the time limit], dup[false]]

想咨询一下这个是啥情况,数据确实没写入到目标端。 全量同步的时候报出来的。
同时检查目标端 端口 机器正常。

zhangst commented 2 years ago

如果全部连不上,用mongo shell连接试试,需要先解决网络问题