alibaba / MongoShake

MongoShake is a universal data replication platform based on MongoDB's oplog. Redundant replication and active-active replication are two most important functions. 基于mongodb oplog的集群复制工具,可以满足迁移和同步的需求,进一步实现灾备和多活功能。
GNU General Public License v3.0
1.72k stars 441 forks source link

release-v2.4.0-20200326-问题-通过kafka传输的raw数据只有op为c的操作 #317

Closed chaoxxx closed 4 years ago

chaoxxx commented 4 years ago

软件环境

MongoShake【release-v2.4.0-20200326】 Mongodb 【v4.0.13】

配置文件

  1. collector.conf.txt
  2. receiver.conf.txt

问题描述

1、本来是打算使用json数据格式的,但是发现速率是在太慢。(ps:在我虚拟机里边差不多一秒只能采集4-5条数据,这个正常吗?)

2、所以打算使用raw类型传入kafka消息队列。然后用receiver解析出来数据。但是发现解析出来的数据只有两种类型的。 a. 空消息

[14:45:09 CST 2020/03/31] [INFO] (mongoshake/receiver.(*ExampleReplayer).handler:130) {"ts":0,"op":"","g":"","ns":"","o":null,"o2":null,"uk":null,"lsid":null,"fromMigrate":false}

b. op:c 的消息

[14:45:10 CST 2020/03/31] [INFO] (mongoshake/receiver.(*ExampleReplayer).handler:130) {"ts":6809953312490651655,"op":"c","g":"","ns":"admin.$cmd","o":[{"Name":"applyOps","Value":[[{"Name":"op"

预期结果

应该和json类型一样,有insert和update的消息。

chaoxxx commented 4 years ago

@vinllen

chaoxxx commented 4 years ago

我之前使用的是feature-2.1-unstable版本。 刚才又分别跑了一下这两个版本。发现release-v2.4.0-20200326向kafka压数的速度明显低于feature-2.1-unstable。1万条数据feature-2.1-unstable版本可以在1分钟内同步完成。release-v2.4.0-20200326需要5分钟以上。

架构

release-v2.4.0-20200326 版本的架构

MongoDB->MongoShake->Kafka->下游组件 ,传输的数据类型为json,选取的数据获取方式为change_stream。

feature-2.1-unstable 版本的架构

MongoDB->MongoShake->receiver->Kafka->下游组件,传输的数据类型为raw,选取的数据获取方式为oplog。

因为之前使用 feature-2.1-unstable 有报错 #314 ,所以准备切到 release-v2.4.0-20200326。但是现在看起来,release-v2.4.0-20200326性能明显低于feature-2.1-unstable。请问我的配置参数或者其他地方还有什么可以优化的吗?

vinllen commented 4 years ago

如果不是通过receiver写入,本身性能是很低的,2.4版本也可以选型采用receiver进行对接,raw格式快一点。关于性能的差异这里有介绍: https://github.com/alibaba/MongoShake/wiki/%E5%A6%82%E4%BD%95%E4%BD%BF%E7%94%A8kafka-tcp-rpc-file%E7%AD%89%E5%BC%82%E6%AD%A5%E9%80%9A%E9%81%93%E8%BF%9B%E8%A1%8C%E6%95%B0%E6%8D%AE%E7%9A%84%E5%8F%91%E9%80%81%E5%92%8C%E6%B6%88%E8%B4%B9 另外一个问题,op只有c和n是因为这段时间拉取只有这两种Oplog。

314 的报错是move chunk问题,目前拉取move chunk是直接报错的,2.4可以启用change stream进行拉取,change stream的性能本身就是低于oplog拉取,参考:

https://github.com/alibaba/MongoShake/wiki/v2.4%E7%89%88%E6%9C%AC%E5%AF%B9%E6%8E%A5change-stream%E6%9E%B6%E6%9E%84%E8%AE%BE%E8%AE%A1

不过4~5条是不可能的,你是虚机cpu打满了吧?qps到k级别应该没问题的

chaoxxx commented 4 years ago

感谢您的回复.

今天我用release-v2.4.0-20200326重新试了一下。

关于第一个问题

【本来是打算使用json数据格式的,但是发现速率是在太慢】,这个今天我重新试了一下,发现collector日志里有记录tps。以下是我测试中的详细配置

测试背景

每秒向mongodb shard集群插入50条数据。

incr_sync.mongo_fetch_method = change_stream
incr_sync.tunnel.message = json

配置文件

collector.conf.txt

collector部分日志

[2020/04/01 18:02:00 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=62449, get=83550, consume=3462, apply=3462, failed_times=0, success=3462, tps=5, ckpt_times=1845, retransimit_times=0, tunnel_traffic=14MB, lsn_ckpt={1585735175,2020-04-01 17:59:35}, lsn_ack={1585735175,2020-04-01 17:59:35}]
[2020/04/01 18:02:05 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=62586, get=83550, consume=3506, apply=3506, failed_times=0, success=3506, tps=4, ckpt_times=1862, retransimit_times=0, tunnel_traffic=14MB, lsn_ckpt={1585735175,2020-04-01 17:59:35}, lsn_ack={1585735175,2020-04-01 17:59:35}]
[2020/04/01 18:02:10 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=68932, get=83598, consume=3562, apply=3562, failed_times=0, success=3562, tps=3, ckpt_times=1877, retransimit_times=0, tunnel_traffic=14MB, lsn_ckpt={1585735185,2020-04-01 17:59:45}, lsn_ack={1585735185,2020-04-01 17:59:45}]
[2020/04/01 18:02:15 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=68948, get=83598, consume=3578, apply=3578, failed_times=0, success=3578, tps=3, ckpt_times=1894, retransimit_times=0, tunnel_traffic=15MB, lsn_ckpt={1585735185,2020-04-01 17:59:45}, lsn_ack={1585735185,2020-04-01 17:59:45}]
[2020/04/01 18:02:20 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=68992, get=83627, consume=3613, apply=3613, failed_times=0, success=3613, tps=5, ckpt_times=1910, retransimit_times=0, tunnel_traffic=15MB, lsn_ckpt={1585735185,2020-04-01 17:59:45}, lsn_ack={1585735185,2020-04-01 17:59:45}]
[2020/04/01 18:02:25 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=68997, get=83627, consume=3625, apply=3625, failed_times=0, success=3625, tps=1, ckpt_times=1922, retransimit_times=0, tunnel_traffic=15MB, lsn_ckpt={1585735185,2020-04-01 17:59:45}, lsn_ack={1585735185,2020-04-01 17:59:45}]
[2020/04/01 18:02:30 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=69082, get=83627, consume=3640, apply=3640, failed_times=0, success=3640, tps=3, ckpt_times=1936, retransimit_times=0, tunnel_traffic=15MB, lsn_ckpt={1585735185,2020-04-01 17:59:45}, lsn_ack={1585735185,2020-04-01 17:59:45}]
[2020/04/01 18:02:35 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=69168, get=83627, consume=3677, apply=3677, failed_times=0, success=3677, tps=8, ckpt_times=1951, retransimit_times=0, tunnel_traffic=15MB, lsn_ckpt={1585735185,2020-04-01 17:59:45}, lsn_ack={1585735185,2020-04-01 17:59:45}]
[2020/04/01 18:02:40 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=72523, get=83636, consume=3736, apply=3736, failed_times=0, success=3736, tps=47, ckpt_times=1966, retransimit_times=0, tunnel_traffic=15MB, lsn_ckpt={1585735185,2020-04-01 17:59:45}, lsn_ack={1585735185,2020-04-01 17:59:45}]
[2020/04/01 18:02:45 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=72583, get=83636, consume=3778, apply=3778, failed_times=0, success=3778, tps=3, ckpt_times=1982, retransimit_times=0, tunnel_traffic=15MB, lsn_ckpt={1585735195,2020-04-01 17:59:55}, lsn_ack={1585735195,2020-04-01 17:59:55}]
[2020/04/01 18:02:50 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=72601, get=83636, consume=3798, apply=3798, failed_times=0, success=3798, tps=4, ckpt_times=1999, retransimit_times=0, tunnel_traffic=15MB, lsn_ckpt={1585735195,2020-04-01 17:59:55}, lsn_ack={1585735195,2020-04-01 17:59:55}]
[2020/04/01 18:02:55 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=72611, get=83636, consume=3814, apply=3814, failed_times=0, success=3814, tps=3, ckpt_times=2015, retransimit_times=0, tunnel_traffic=16MB, lsn_ckpt={1585735195,2020-04-01 17:59:55}, lsn_ack={1585735195,2020-04-01 17:59:55}]
[2020/04/01 18:03:00 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=72674, get=83636, consume=3863, apply=3863, failed_times=0, success=3863, tps=3, ckpt_times=2031, retransimit_times=0, tunnel_traffic=16MB, lsn_ckpt={1585735195,2020-04-01 17:59:55}, lsn_ack={1585735195,2020-04-01 17:59:55}]
[2020/04/01 18:03:05 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=72753, get=83636, consume=3896, apply=3896, failed_times=0, success=3896, tps=3, ckpt_times=2047, retransimit_times=0, tunnel_traffic=16MB, lsn_ckpt={1585735195,2020-04-01 17:59:55}, lsn_ack={1585735195,2020-04-01 17:59:55}]
[2020/04/01 18:03:10 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=72871, get=83636, consume=3939, apply=3939, failed_times=0, success=3939, tps=20, ckpt_times=2063, retransimit_times=0, tunnel_traffic=16MB, lsn_ckpt={1585735195,2020-04-01 17:59:55}, lsn_ack={1585735195,2020-04-01 17:59:55}]
[2020/04/01 18:03:15 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=72885, get=83636, consume=3955, apply=3955, failed_times=0, success=3955, tps=3, ckpt_times=2080, retransimit_times=0, tunnel_traffic=16MB, lsn_ckpt={1585735195,2020-04-01 17:59:55}, lsn_ack={1585735195,2020-04-01 17:59:55}]
[2020/04/01 18:03:20 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=73677, get=83636, consume=4005, apply=4005, failed_times=0, success=4005, tps=3, ckpt_times=2095, retransimit_times=0, tunnel_traffic=16MB, lsn_ckpt={1585735195,2020-04-01 17:59:55}, lsn_ack={1585735195,2020-04-01 17:59:55}]
[2020/04/01 18:03:25 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=73770, get=83636, consume=4036, apply=4036, failed_times=0, success=4036, tps=4, ckpt_times=2112, retransimit_times=0, tunnel_traffic=17MB, lsn_ckpt={1585735205,2020-04-01 18:00:05}, lsn_ack={1585735205,2020-04-01 18:00:05}]
[2020/04/01 18:03:30 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=75596, get=83636, consume=4056, apply=4056, failed_times=0, success=4056, tps=4, ckpt_times=2128, retransimit_times=0, tunnel_traffic=17MB, lsn_ckpt={1585735385,2020-04-01 18:03:05}, lsn_ack={1585735385,2020-04-01 18:03:05}]
[2020/04/01 18:03:35 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=75645, get=83636, consume=4095, apply=4095, failed_times=0, success=4095, tps=3, ckpt_times=2142, retransimit_times=0, tunnel_traffic=17MB, lsn_ckpt={1585735415,2020-04-01 18:03:35}, lsn_ack={1585735415,2020-04-01 18:03:35}]
[2020/04/01 18:03:40 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=75693, get=83636, consume=4118, apply=4118, failed_times=0, success=4118, tps=1, ckpt_times=2152, retransimit_times=0, tunnel_traffic=17MB, lsn_ckpt={1585735435,2020-04-01 18:03:55}, lsn_ack={1585735435,2020-04-01 18:03:55}]
[2020/04/01 18:03:45 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=75716, get=83636, consume=4127, apply=4127, failed_times=0, success=4127, tps=1, ckpt_times=2162, retransimit_times=0, tunnel_traffic=17MB, lsn_ckpt={1585735435,2020-04-01 18:03:55}, lsn_ack={1585735435,2020-04-01 18:03:55}]
[2020/04/01 18:03:50 CST] [INFO] [common.(*ReplicationMetric).startup.func1:137] [name=mongos, filter=75716, get=83636, consume=4129, apply=4129, failed_times=0, success=4129, tps=1, ckpt_times=2164, retransimit_times=0, tunnel_traffic=17MB, lsn_ckpt={1585735435,2020-04-01 18:03:55}, lsn_ack={1585735435,2020-04-01 18:03:55}]

机器负载

image

结论

tps确实低,但机器负载也不高,不清楚瓶颈在哪儿

关于第二个问题

【raw解析出来的数据op只有n和c】。golang不是太熟,所以没法具体定位问题。不过我单独向mongodb插入了一条数据。得到了下面的记录。

测试背景

incr_sync.mongo_fetch_method = change_stream
incr_sync.tunnel.message = raw

日志

  1. 从kafka通道里拿出来的raw数据
    ԭ¤ %%  ts^opignssync.realtime_airo_id342F88D20-33EF-4074-9D45-7325025D600120200401154144   create_time܃°4q data_time󿿖°4qgd_id%96a8c1fa-ba0f-4e86-8a49-f8930c66da91mn0152316J333001mp_id%42F88D20-33EF-4074-9D45-7325025D6001protocol1  update_time܃°4qvalues²0pollutant_code901cou_valuemin_valueflagNmax_zsrtd_zsrtd_value19.79min_zsmax_value1pollutant_code902cou_valuemin_valueflagNmax_zsrtd_zsrtd_value17.51min_zsmax_value2pollutant_codeB02cou_valuemin_valueflagNmax_zsrtd_zsrtd_value88.4min_zsmax_value3pollutant_code008cou_valuemin_valueflagNmax_zsrtd_zsrtd_value23.7min_zsmax_value4pollutant_code001cou_valuemin_valueflagNmax_zsrtd_zsrtd_value26min_zsmax_value5pollutant_codeS05cou_valuemin_valueflagNmax_zsrtd_zsrtd_value22.3min_zsmax_value6pollutant_codeS02cou_valuemin_valueflagNmax_zsrtd_zsrtd_value28.3min_zsmax_value7pollutant_codeS03cou_valuemin_valueflagNmax_zsrtd_zsrtd_value26min_zsmax_value8pollutant_codeS01cou_valuemin_valueflagNmax_zsrtd_zsrtd_value17.4min_zsmax_value9pollutant_code002cou_valuemin_valueflagNmax_zsrtd_zsrtd_value70.07min_zsmax_value10pollutant_codeS08cou_valuemin_valueflagNmax_zsrtd_zsrtd_value12.3min_zsmax_value11pollutant_code004cou_valuemin_valueflagNmax_zsrtd_zsrtd_value21.6min_zsmax_value12pollutant_code003cou_valuemin_valueflagNmax_zsrtd_zsrtd_value24.8min_zsmax_valueo2uk
    lsifromMigrate
  2. 对应receiver.log
    [15:43:17 CST 2020/04/01] [INFO] (mongoshake/receiver.(*ExampleReplayer).handler:143){"ts":0,"op":"","g":"","ns":"","o":null,"o2":null,"uk":null,"lsid":null,"fromMigrate":false}

    结论

    感觉好像receiver接受的raw处理的不太对。

@vinllen

vinllen commented 4 years ago

op字段为空,这里的确是有问题的,版本发布没有覆盖到receiver的测试,这个2.4.1版本我修复一下。 性能负载可以打下restful接口看下,curl一下9100的端口,看看各个队列的情况,能知道大概瓶颈。可以钉钉私聊我一下,添加请说明一下,readme上扫码加群能看到我。

vinllen commented 4 years ago

2.4.1版本已经修复解析的问题。