Automattic / mongoose

MongoDB object modeling designed to work in an asynchronous environment.
https://mongoosejs.com
MIT License
26.96k stars 3.84k forks source link

Rare failure to auto reconnect #8547

Closed leon-unity closed 4 years ago

leon-unity commented 4 years ago

Do you want to request a feature or report a bug? bug

What is the current behavior? Normally the mongoose driver auto-reconnects if the mongodb server within the cluster goes down (as was the case temporarily during the update), however during the recent update from mongodb 4.2.1 to 4.2.2 two of our pods failed to reconnect and lost their connections. Restarting the pods resolved it, but reporting on the offchance that it may be resolvable.

If the current behavior is a bug, please provide the steps to reproduce.

All we did was update the mongodb version (via cloud.mongodb.com) from 4.2.1 to 4.2.2, we've done patch level upgrades before without issue.

What is the expected behavior? The mongoose driver should automatically fail over (and the majority of the pods did do that) and reconnect.

What are the versions of Node.js, Mongoose and MongoDB you are using? Note that "latest" is not a version. Failing pods: Mongoose: 5.8.9 Nodejs 12.14.1, MongoDB 4.2.2 Mongoose: 5.4.6 NodeJS 10.15.3, MongoDB 4.2.2 Note: Not all the replicated pods actually failed (roughly 1/4 of them did)

The non failing pods were a mix of versions (none matching the above)

vkarpov15 commented 4 years ago

A few questions:

  1. What versions do the other pods use?
  2. It looks like you're using MongoDB Atlas, is that the case?
  3. Do you use useUnifiedTopology?
  4. +srv in your Atlas connection string or not?
  5. Can you please define "two of our pods failed to reconnect and lost their connections"? Do you see some sort of error message?
jussikuosa commented 4 years ago

@leon-unity, can you please provide @vkarpov15 the details?

leon-unity commented 4 years ago

Apologies - the notification got swamped.

  1. The failing service was replicated in kubernetes a number of times - everything the same (load distributed evenly across them), and only about 1/4 of them failed - the rest handled it without issue (so it's a timing / state issue at the time of upgrade).

There are a number of other services using mongoose/mongodb - but there were no failures on any of these kubernetes pods/containers (each were also replicated).

"mongoose": "5.7.4" "mongoose": "5.5.15" "mongoose": "5.4.19" "mongoose": "5.3.16" "mongoose": "5.5.12" "mongodb": "3.5.2" "mongodb": "3.3.1", "mongodb": "3.2.2"

  1. Yes - we're using atlas for staging and production

  2. Not on the failing services - on some of the services that didn't have any issues we do, but not all.

  3. Not in the two services that showed the failure.

  4. Not really - which I know isn't helpful, sorry. Our external monitoring system reported the traffic issue those pods were having and an internal mongoose monitor reported that the client wasn't connected (our message) based on the mongoose.connection.readyState being false/0 and never recovering (it's connecting to a replicaset so should have failed over I believe). After some investigation we restarted the failing pods to resolve the issue in production.

We also log out the mongoose.connection.on('error') but didn't see anything (however it's possible that something failed on our end to not get the message).

I would love it if this could be resolved - but at the same time I appreciate that I'm unable to provide much information, digging into the failure didn't show anything else in the logs, and attempts at replication in staging couldn't reproduce it.

vkarpov15 commented 4 years ago

I took a look and realistically I don't have enough information to make any headway on this. Tried running some replica set failovers locally, but Mongoose maintains connectivity correctly.

If this continues to be an issue, I'd recommend enabling useUnifiedTopology, which gives Mongoose better insights into the state of the connection to Atlas.