cayasso / mongo-oplog

Watch mongodb oplog in a simple way
380 stars 91 forks source link

MongoError: operation exceeded time limit #61

Open doktoroblivion opened 6 years ago

doktoroblivion commented 6 years ago

Anyone seeing the above error on the oplog cursor? My configuration is I have a remote machine which is my oplog source, and my oplog application is running locally pointing to a local mongoDB instance. After some time and many insert, update and delete events, I will eventually get the following. I am using mongoDB 3.4.x locally and my remote machine is 3.2.8.

The following appears to occur sometime when questions of connection and query length reach the internal MongoDB limit cursor.maxTimeMS. Can anyone tell me or suggest a sufficient reason that the code would get this? stream.js appears to be correctly coded, is there an issue elsewhere?

    tailable: true,
    awaitData: true,
    oplogReplay: true,
    noCursorTimeout: true,

Error is:

  startOplog 0] ------- Error ------- +10s
  startOplog-details 0] Error object: { MongoError: operation exceeded time limit
    at Function.MongoError.create (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/error.js:31:11)
    at /home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:497:72
    at authenticateStragglers (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:443:16)
    at Connection.messageHandler (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:477:5)
    at TLSSocket.<anonymous> (/home/e/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/connection.js:331:22)
    at emitOne (events.js:115:13)
    at TLSSocket.emit (events.js:210:7)
    at addChunk (_stream_readable.js:252:12)
    at readableAddChunk (_stream_readable.js:239:11)
    at TLSSocket.Readable.push (_stream_readable.js:197:10)
  name: 'MongoError',
  message: 'operation exceeded time limit',
  ok: 0,
  errmsg: 'operation exceeded time limit',
  code: 50 } +10s
  startOplog 0] ------- Error ------- +1ms
  mongo-oplog stream ended +1ms
  startOplog 0] ------- End ------- +0ms
  startOplog-details 0] End object: undefined +0ms
  startOplog 0] ------- End ------- +0ms
  mongo-oplog oplog error {"name":"MongoError","message":"operation exceeded time limit","ok":0,"errmsg":"operation exceeded time limit","code":50} +1ms
  startOplog 2] ------- Error ------- +1ms
doktoroblivion commented 6 years ago

I have made the following changes to stream.js which have made the symptoms better, but not perfect.

  async function time() {
    if (ts) return (typeof ts !== 'number') ? ts : Timestamp(0, ts)

    const doc = await coll
      .find({ }, { ts: 1 })
      .sort({ $natural: -1 })
      .limit(1)
      .nextObject()

    //return doc ? doc.ts : Timestamp(0, (Date.now() / 1000 | 0))
    return doc ? doc.ts : Timestamp(0, 0)
  }

  if (ns) query.ns = { $regex: regex(ns) }
  query.ts = { $gt: await time() }

  return (await coll.find(query, {
    "tailable": true,
    "awaitData": true,
    "oplogReplay": true,
    "noCursorTimeout": true,
    "maxTimeMS": 200,
    "numberOfRetries": Number.MAX_VALUE
  })).stream()

I am also trying the following, but do not have enough data points on how that will do yet.

  return (await coll.find(query, {
    "tailable": true,
    "awaitData": true,
    "oplogReplay": true,
    "maxTimeMS": 100,
    "numberOfRetries": Number.MAX_VALUE
  }).addCursorFlag('noCursorTimeout',true)).stream()
doktoroblivion commented 6 years ago

So I chased this far enough to know that in the mongodb npm driver, collection.js checks for and translates timeout, ie.

  // Translate to new command option noCursorTimeout
  if(typeof newOptions.timeout == 'boolean') newOptions.noCursorTimeout = newOptions.timeout;

Therefore I made the following change to see if that would help at all.

    const doc = await coll
      .find({ }, { ts: 1, timeout: false })
      .sort({ $natural: -1 })
      .limit(1)
      .nextObject()

    //return doc ? doc.ts : Timestamp(0, (Date.now() / 1000 | 0))
    return doc ? doc.ts : Timestamp(0, 0)
  }

  if (ns) query.ns = { $regex: regex(ns) }
  query.ts = { $gt: await time() }

  return (await coll.find(query, {
    "tailable": true,
    "awaitData": true,
    "oplogReplay": true,
    "maxTimeMS": 100,
    "timeout": false,
    "numberOfRetries": Number.MAX_VALUE
  })).stream()

After some time it still throws the error, not on just one of my oplog streams, but all.

  startOplog 1] ------- End ------- +0ms
  startOplog-details +0ms] NO End object data
  startOplog 1] ------- End ------- +0ms
  mongo-oplog oplog error {"name":"MongoError","message":"operation exceeded time limit","ok":0,"errmsg":"operation exceeded time limit","code":50} +0ms
  startOplog 0] ------- Error ------- +0ms
  startOplog-details 0] Error object: { MongoError: operation exceeded time limit
    at Function.MongoError.create (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/error.js:31:11)
    at /home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:497:72
    at authenticateStragglers (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:443:16)
    at Connection.messageHandler (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/pool.js:477:5)
    at TLSSocket.<anonymous> (/home/egriffin/mesos-infrastructure/src/mongo-oplog-2.1.0/node_modules/mongodb-core/lib/connection/connection.js:333:22)
    at emitOne (events.js:115:13)
    at TLSSocket.emit (events.js:210:7)
    at addChunk (_stream_readable.js:252:12)
    at readableAddChunk (_stream_readable.js:239:11)
    at TLSSocket.Readable.push (_stream_readable.js:197:10)
  name: 'MongoError',
  message: 'operation exceeded time limit',
  ok: 0,
  errmsg: 'operation exceeded time limit',
  code: 50 } +0ms
doktoroblivion commented 6 years ago

Next steps I hacked mongodb npm module, lib/collection.js

  // Translate to new command option noCursorTimeout
  if(typeof newOptions.timeout == 'boolean') newOptions.noCursorTimeout = newOptions.timeout;
  newOptions.noCursorTimeout = false;

  // Merge in options to command
  for(var name in newOptions) {
    if(newOptions[name] != null) {
 console.log("coll.js opt[%s]: %s", name, newOptions[name]);
      findCommand[name] = newOptions[name];
    }
  }

I see its getting set, not sure about its use yet:

coll.js opt[readPreference]: [object Object]
coll.js opt[tailable]: true
coll.js opt[awaitData]: true
coll.js opt[oplogReplay]: true
coll.js opt[maxTimeMS]: 100
coll.js opt[timeout]: false
coll.js opt[numberOfRetries]: 1.7976931348623157e+308
coll.js opt[skip]: 0
coll.js opt[limit]: 0
coll.js opt[slaveOk]: true
coll.js opt[noCursorTimeout]: false
doktoroblivion commented 6 years ago

I think I found the real issue, the index.js code is not restarting the tail on timeout. So I am testing the following.

  function onerror(err) {
    if (/cursor (killed or )?timed out/.test(err.message)) {
      debug('cursor timeout - re-tailing %j', err)
      tail()
    } else if (/exceeded time limit/.test(err.message)) {
      debug('cursor exceeded timeout - re-tailing %j', err)
      tail()
    } else {
      debug('oplog error %j', err)
      oplog.emit('error', err)
    }
  }
doktoroblivion commented 6 years ago

The above appears to work, I will create a PR to contrib this back. Since I am using TLS connections across the wire this may have something to do with you missing this particular cursor error, as well as perhaps the versions of node and npm I am using.

heipei commented 6 years ago

Hey, did you ever create a PR? Was this issue solved? Seeing some problems with oplog tailing here as well, and just want to make sure which path to chase down on.

doktoroblivion commented 6 years ago

@heipei No, I never created a PR for the above, but the above does appear to work for us. Its strange since I am not sure what triggers the other message and have not been able to track down the cause, other than possible internal private network contstraints (we do not use external IPs with oplog).