Closed pinkyu1123 closed 3 years ago
同步了多久?报错cursor not found的时候有看过源端负载情况如何吗?mongodb日志层面有没有什么报错。 另外,给一下MongoDB跟MongoShake的版本号
完整的错误日志也给一下
logs/collector.log
[2020/09/21 08:01:40 UTC] [INFO] [name=mongos, stage=full, get=1646818, tps=695]
[2020/09/21 08:01:45 UTC] [INFO] [name=mongos, stage=full, get=1650245, tps=691]
[2020/09/21 08:01:50 UTC] [INFO] [name=mongos, stage=full, get=1653740, tps=680]
[CRIT] splitter reader[DocumentReader src[mongodb://admin:***@host:port] ns[{DB coll}] query[map[]]] get next document failed: (CursorNotFound) Cursor not found (namespace: 'DB.collection', id: 1694252915531455415).
Source : MongoDB atlas Version 3.6.20 Target: MongoDB atlas Version 4.4.10 Mongo-Shake Version mongo-shake-v2.4.11
調整三次關於 full sync configuration,線程,併發數以及batch大小,每次同樣都是某張五億筆大表開始sync的時間,往後算30min 之後就噴err,猜測是Source Mongo 某個 time out機制影響?
试一下2.4.13版本,你这个问题应该是cursor timeout导致被Kill了,2.4.13中设置了NoCursorTimeout
試過2.4.13版,一樣半小時準時噴錯
[CRIT] splitter reader[DocumentReader src[mongodb://admin:***@host:port] ns[{DB coll}] query[map[]]] get next document failed: (CursorNotFound) Cursor not found (namespace: 'DB.collection', id: 1694252915531455415).
能否给一下对应时间点前后source mongodb的日志
查過源端所有shard mongod的log ,在發生cursor timeout前都是正常如下sync的紀錄
2020-09-22T11:23:51.550+0000 I COMMAND [conn21538] command DB.xxxxxxx command: getMore { getMore: 1290686147562, collection: "xxxxxxx", batchSize: 8192, lsid: { id: UUID("xxxxxxx"), uid: BinData(0, 3B408CB48548B5037822C10EB0976B3CBF2CEE3BF9C708796BF03941FBECD80F) }, $readPreference: { mode: "secondaryPreferred" }, $clusterTime: { clusterTime: Timestamp(1600773831, 1730), signature: { hash: BinData(0, 2F3CC426EBF54F0A0B138D03A437FAD2A850B851), keyId: 6820119147892965560 } }, $client: { driver: { name: "mongo-go-driver", version: "v1.4.0+prerelease" }, os: { type: "linux", architecture: "amd64" }, platform: "go1.10.3", mongos: { host: "xxxxxxx", client: "xxxxxxx", version: "3.6.20" } }, $configServerState: { opTime: { ts: Timestamp(1600773831, 252), t: 44 } }, $db: "xxxxxxx" } originatingCommand: { find: "xxxxxxx", projection: { $sortKey: { $meta: "sortKey" } }, sort: { _id: 1 }, hint: { _id: 1 }, readConcern: { level: "local" }, batchSize: 8192, noCursorTimeout: true, shardVersion: [ Timestamp(357, 1), ObjectId('5ec4ff7d0ece43747db41c0b') ], lsid: { id: UUID("xxxxxxx"), uid: BinData(0, 3B408CB48548B5037822C10EB0976B3CBF2CEE3BF9C708796BF03941FBECD80F) }, $readPreference: { mode: "secondaryPreferred" }, $clusterTime: { clusterTime: Timestamp(1600772032, 929), signature: { hash: BinData(0, A694CA7331360C053B81E7C29A585C7C43078C87), keyId: 6820119147892965560 } }, $client: { driver: { name: "mongo-go-driver", version: "v1.4.0+prerelease" }, os: { type: "linux", architecture: "amd64" }, platform: "go1.10.3", mongos: { host: "xxxxxxx", client: "xxxxxxx", version: "3.6.20" } }, $configServerState: { opTime: { ts: Timestamp(1600772032, 7), t: 44 } }, $db: "xxxxxxx" } planSummary: IXSCAN { _id: 1 } cursorid:1290686147562 keysExamined:8192 docsExamined:8192 numYields:64 nreturned:8192 reslen:1694978 locks:{ Global: { acquireCount: { r: 130 }, acquireWaitCount: { r: 33 }, timeAcquiringMicros: { r: 53920 } }, Database: { acquireCount: { r: 65 } }, Collection: { acquireCount: { r: 65 } } } protocol:op_msg 136ms
然後一到start sync開始時間的30分鐘後,準時collector.log 就會出現Cursor not found
對了,還有我們有透過stunnel的服務傳輸SSL的連線,打到altas mongodb做 mongo-shake sync,因為mongo-shake還不支援SSL,所以只好出此下策,不過先前測試raw data 10G內的mongodb all sync是完全沒任何問題的。
SSL后面会支持。 cursor nout found这个问题我定位一下。
根據官方的文檔,如果隱式對話空閒了三十分鐘,它將會終止所有正在進行的操作並打開與該會話關聯的游標。這包括配置了noCursorTimeout或maxTimeMS大於30分鐘的游標。
我也遇到了找个问题
MongoDB版本4.0 shake版本 2.4.17 开启full同步报错
query[map[]]] get next document failed: (BadValue) cannot set maxTimeMS on getMore command for a non-awaitData cursor
官方driver存在bug导致,mgo的driver不存在问题,深层原因还在定位中
2.4.19版本已发布,修复这个问题,感谢各位反馈和配合
想請問一下,源端跟目的端都是分片集的架構,嘗試了full or all sync 遇上五億多筆的大表,都會出現Cursor not found的錯誤,然後停止sync