alibaba / MongoShake

MongoShake is a universal data replication platform based on MongoDB's oplog. Redundant replication and active-active replication are two most important functions. 基于mongodb oplog的集群复制工具,可以满足迁移和同步的需求,进一步实现灾备和多活功能。
GNU General Public License v3.0
1.72k stars 441 forks source link

get next document failed: (CursorNotFound) Cursor not found #451

Closed pinkyu1123 closed 3 years ago

pinkyu1123 commented 4 years ago

想請問一下,源端跟目的端都是分片集的架構,嘗試了full or all sync 遇上五億多筆的大表,都會出現Cursor not found的錯誤,然後停止sync

vinllen commented 4 years ago

同步了多久?报错cursor not found的时候有看过源端负载情况如何吗?mongodb日志层面有没有什么报错。 另外,给一下MongoDB跟MongoShake的版本号

vinllen commented 4 years ago

完整的错误日志也给一下

pinkyu1123 commented 4 years ago

logs/collector.log

[2020/09/21 08:01:40 UTC] [INFO] [name=mongos, stage=full, get=1646818, tps=695]
[2020/09/21 08:01:45 UTC] [INFO] [name=mongos, stage=full, get=1650245, tps=691]
[2020/09/21 08:01:50 UTC] [INFO] [name=mongos, stage=full, get=1653740, tps=680]
[CRIT] splitter reader[DocumentReader src[mongodb://admin:***@host:port] ns[{DB coll}] query[map[]]] get next document failed: (CursorNotFound) Cursor not found (namespace: 'DB.collection', id: 1694252915531455415).
pinkyu1123 commented 4 years ago

Source : MongoDB atlas Version 3.6.20 Target: MongoDB atlas Version 4.4.10 Mongo-Shake Version mongo-shake-v2.4.11

pinkyu1123 commented 4 years ago

調整三次關於 full sync configuration,線程,併發數以及batch大小,每次同樣都是某張五億筆大表開始sync的時間,往後算30min 之後就噴err,猜測是Source Mongo 某個 time out機制影響?

vinllen commented 4 years ago

试一下2.4.13版本,你这个问题应该是cursor timeout导致被Kill了,2.4.13中设置了NoCursorTimeout

pinkyu1123 commented 4 years ago

試過2.4.13版,一樣半小時準時噴錯

[CRIT] splitter reader[DocumentReader src[mongodb://admin:***@host:port] ns[{DB coll}] query[map[]]] get next document failed: (CursorNotFound) Cursor not found (namespace: 'DB.collection', id: 1694252915531455415).
vinllen commented 4 years ago

能否给一下对应时间点前后source mongodb的日志

pinkyu1123 commented 4 years ago

查過源端所有shard mongod的log ,在發生cursor timeout前都是正常如下sync的紀錄

2020-09-22T11:23:51.550+0000 I COMMAND  [conn21538] command DB.xxxxxxx command: getMore { getMore: 1290686147562, collection: "xxxxxxx", batchSize: 8192, lsid: { id: UUID("xxxxxxx"), uid: BinData(0, 3B408CB48548B5037822C10EB0976B3CBF2CEE3BF9C708796BF03941FBECD80F) }, $readPreference: { mode: "secondaryPreferred" }, $clusterTime: { clusterTime: Timestamp(1600773831, 1730), signature: { hash: BinData(0, 2F3CC426EBF54F0A0B138D03A437FAD2A850B851), keyId: 6820119147892965560 } }, $client: { driver: { name: "mongo-go-driver", version: "v1.4.0+prerelease" }, os: { type: "linux", architecture: "amd64" }, platform: "go1.10.3", mongos: { host: "xxxxxxx", client: "xxxxxxx", version: "3.6.20" } }, $configServerState: { opTime: { ts: Timestamp(1600773831, 252), t: 44 } }, $db: "xxxxxxx" } originatingCommand: { find: "xxxxxxx", projection: { $sortKey: { $meta: "sortKey" } }, sort: { _id: 1 }, hint: { _id: 1 }, readConcern: { level: "local" }, batchSize: 8192, noCursorTimeout: true, shardVersion: [ Timestamp(357, 1), ObjectId('5ec4ff7d0ece43747db41c0b') ], lsid: { id: UUID("xxxxxxx"), uid: BinData(0, 3B408CB48548B5037822C10EB0976B3CBF2CEE3BF9C708796BF03941FBECD80F) }, $readPreference: { mode: "secondaryPreferred" }, $clusterTime: { clusterTime: Timestamp(1600772032, 929), signature: { hash: BinData(0, A694CA7331360C053B81E7C29A585C7C43078C87), keyId: 6820119147892965560 } }, $client: { driver: { name: "mongo-go-driver", version: "v1.4.0+prerelease" }, os: { type: "linux", architecture: "amd64" }, platform: "go1.10.3", mongos: { host: "xxxxxxx", client: "xxxxxxx", version: "3.6.20" } }, $configServerState: { opTime: { ts: Timestamp(1600772032, 7), t: 44 } }, $db: "xxxxxxx" } planSummary: IXSCAN { _id: 1 } cursorid:1290686147562 keysExamined:8192 docsExamined:8192 numYields:64 nreturned:8192 reslen:1694978 locks:{ Global: { acquireCount: { r: 130 }, acquireWaitCount: { r: 33 }, timeAcquiringMicros: { r: 53920 } }, Database: { acquireCount: { r: 65 } }, Collection: { acquireCount: { r: 65 } } } protocol:op_msg 136ms

然後一到start sync開始時間的30分鐘後,準時collector.log 就會出現Cursor not found

pinkyu1123 commented 4 years ago

對了,還有我們有透過stunnel的服務傳輸SSL的連線,打到altas mongodb做 mongo-shake sync,因為mongo-shake還不支援SSL,所以只好出此下策,不過先前測試raw data 10G內的mongodb all sync是完全沒任何問題的。

vinllen commented 4 years ago

SSL后面会支持。 cursor nout found这个问题我定位一下。

pinkyu1123 commented 4 years ago

https://docs.mongodb.com/manual/reference/method/cursor.noCursorTimeout/#session-idle-timeout-overrides-nocursortimeout

根據官方的文檔,如果隱式對話空閒了三十分鐘,它將會終止所有正在進行的操作並打開與該會話關聯的游標。這包括配置了noCursorTimeout或maxTimeMS大於30分鐘的游標。

Hisoka-J commented 4 years ago

我也遇到了找个问题

jianhuihi commented 3 years ago

MongoDB版本4.0 shake版本 2.4.17 开启full同步报错

query[map[]]] get next document failed: (BadValue) cannot set maxTimeMS on getMore command for a non-awaitData cursor

vinllen commented 3 years ago

官方driver存在bug导致,mgo的driver不存在问题,深层原因还在定位中

vinllen commented 3 years ago

2.4.19版本已发布,修复这个问题,感谢各位反馈和配合