Automattic / mongoose

MongoDB object modeling designed to work in an asynchronous environment.
https://mongoosejs.com
MIT License
26.93k stars 3.84k forks source link

No primary server available #3634

Closed ChrisZieba closed 8 years ago

ChrisZieba commented 8 years ago

I have an issue that is rather difficult to debug, and was wondering if anyone sees anything wrong with my configuration.

Error no primary server available

Nodejs version 4.2.1 and mongoDB version 3.0.7 with mongoose 4.2.8.

This seems to happen randomly and will open many connection until I finally restart the node process. The cluster is healthy at all times during this error. This error happens hundreds of times per hour. There does not seem to be any consistency as to when the error will begin. For example, it occurs when the cluster is operating normally and no changes to the primary have been made.

This is what the db stats look like. As you can see the number of connections will steadily increase. If I kill the node process and start a new one everything is fine.

screen shot 2015-11-30 at 5 21 01 pm

Config

  // Connect
  mongoose.connect(config.mongo.connectionString, {
    server: {
      socketOptions: {
        socketTimeoutMS: 5 * 60 * 1000,
        keepAlive: 1
      }
    },
    replset: {
      socketOptions: {
        socketTimeoutMS: 5 * 60 * 1000,
        keepAlive: 1
      }
    }
  });

Connection String

mongodb://username:password@mongo-1.cz.0200.mongodbdns.com:27000,mongo-2.cz.0200.mongodbdns.com:27000,mongo-3.cz.0200.mongodbdns.com:27000/dbase

Stack trace

node_modules/mongoose/node_modules/mongodb/node_modules/mongodb-core/lib/topologies/replset.js:860pickServer    
node_modules/mongoose/node_modules/mongodb/node_modules/mongodb-core/lib/topologies/replset.js:437command   
node_modules/mongoose/node_modules/mongodb/lib/replset.js:392command    
node_modules/mongoose/node_modules/mongodb/lib/db.js:281executeCommand  
node_modules/mongoose/node_modules/mongodb/lib/db.js:305command 
node_modules/newrelic/lib/instrumentation/mongodb.js:177wrapped 
node_modules/mongoose/node_modules/mongodb/lib/collection.js:2327findAndModify  
node_modules/mongoose/node_modules/mongodb/lib/collection.js:2265findAndModify  
node_modules/newrelic/lib/instrumentation/mongodb.js:177wrapped [as findAndModify]  
node_modules/mongoose/lib/drivers/node-mongodb-native/collection.js:136(anonymous function) [as findAndModify]  
node_modules/mongoose/node_modules/mquery/lib/collection/node.js:79findAndModify    
node_modules/mongoose/lib/query.js:1833_findAndModify   
node_modules/mongoose/lib/query.js:1621_findOneAndUpdate    
node_modules/mongoose/node_modules/kareem/index.js:156none  
node_modules/mongoose/node_modules/kareem/index.js:18none
amit777 commented 8 years ago

Do you see it possible/likely that the massive amounts of disconnect/reconnect can intermittently cause the no primary server available issue?

christkv commented 8 years ago

yes as there will be a brief period where there might not be any servers in the set

ChrisZieba commented 8 years ago

@christkv I've been waiting until this happens again to send you some logs in that other ticket. Our cluster has actually been stable for the last few weeks and we have not seen this error.

christkv commented 8 years ago

@ChrisZieba funny how that always seems to happen lol :+1: I'Il leave the ticket open in jira for now and see what we can figure out.

amit777 commented 8 years ago

@christkv Hi Christian, i'm just curious if you have any pointers on workarounds in the case of lower traffic. I was thinking of just reducing the pool size as well as increasing the timeouts.

amit777 commented 8 years ago

if it helps anyone else, I removed the socket timeout as well as increased keepAlive to 200 and also reduced the poolsize to 3.. i seem to have a lot less disconnect/reconnects.. however it does still occasionally happen.

refaelos commented 8 years ago

If it helps anyone, we removed almost all mongoose settings, including socketTimeout and connectionTimeout and keepAlive and connections started to be stable. Our poolSize is 200. I'm not sure it's the recommended approach but it works now. We're still monitoring it to make sure it holds.

mongoose v4.4.2 Node 4 Mongo 3.0

christkv commented 8 years ago

Do you have a huge amount of slow operations ? if you don't I don't think you will notice any difference between a pool of 20 sockets vs 500.

refaelos commented 8 years ago

Sorry... it's 200. Fixed the comment.

And yeah, you're right. We don't sense much difference but we rather have the pool size larger than smaller.

The real problem with when connections keep opening and not closed. This used to happen until we removed all mongoose timeout and keepAlive settings. I wonder why these are handled by mongoose/mongo-driver and not letting the OS do it?

christkv commented 8 years ago

Well 2.1.7 and higher has a redesigned pool that avoid this. If you set socketTimeout 0 you delegate it to the os but that might be as much as 10 minutes of hanging connections.

refaelos commented 8 years ago

Ok. interesting. So now that I removed the keepAlive and socketTimeout settings what are the default settings?

christkv commented 8 years ago

it depends, not sure if mongoose set any specific settings as default. if you use the MongoClient.connect method in the driver it's 30 seconds for both connect and socket timeouts.

refaelos commented 8 years ago

We do use connect but when we set 30 secs manually connections starting to pile up.

christkv commented 8 years ago

Well with 500 connections you need at least 500 ops inside the socketTimeout period to keep the pool open, otherwise it will close down and force a reconnect. This changes in 2.1.7 however as the pool is a growing/shrinking model.

15astro commented 8 years ago

I am having same issue with mongodb 3.2.6 and mongoose 4.3.4. Any help on this?

refaelos commented 8 years ago

@15astro try to remove the settings of socketTimeout and connectionTimeout and see if it helps.

15astro commented 8 years ago

@refaelos Ok..willl try that..I tried with keepAlive=6000 but that didn't help. Just wanted to know how removingsocketTimeout and connectionTimeout will help?

refaelos commented 8 years ago

Yeah we tried it with different values and only when we completely removed these settings things started to work well.

15astro commented 8 years ago

@refaelos: I found no luck with removing these settings. Any other thing I am missing?

refaelos commented 8 years ago

@15astro no man. Sorry. This is how our settings looks like today:

mongo   : {
    uri    : process.env.MNG_URL || 'mongodb://localhost/myDB',
    options: {
      user   : process.env.MNG_USER,
      pass   : process.env.MNG_PASS,
      replset: {
        poolSize: 200
      }
    }

  }
adriank commented 8 years ago

In my case it was related to lack of IP to name binding in /etc/hosts.

If you have set up replica set with names instead of IPs and you have something like this in /etc/hosts of MongoDB nodes:

10.10.10.10 mongodb-2gb-fra1-02 10.10.10.11 mongodb-2gb-fra1-01 10.10.10.12 mongodb-2gb-fra1-03

Then you also need to put it in /etc/hosts of all your app servers.

I thought that node-mongo connects according to whatever I put in the URI, but it's not the case.

It seems that node-mongo connects by IP or name from Mongo URI, then gets hostnames of other replica members from the first MongoDB node that responded to request. It gets for example mongodb-2gb-fra1-03 and passes it to OS for resolving. If OS doesn't know anything about mongodb-2gb-fra1-03, it throws "Error no primary server available".

Hope that helps.

christkv commented 8 years ago

@adriank yes that's correct it bases it's connections of the ones it gets back from the replicaset config. The reason is that this is the canonical source of truth about a replicaset. This is also why all addresses in the replicaset configuration must be resolvable by the driver for the driver to failover correctly and for it to be able to detect servers being added and removed from the set. Previous drivers did not implement the SDAM spec and where more lax. This however would cause problems in production environments.

adriank commented 8 years ago

@christkv However it is a nightmare for tools like our MongoSpector. Because of it we have problems with connecting securely to more than one replica from one host. DigitalOcean auto-generates names to droplets that almost nobody changes and the effect is that many clients have mongodb-2gb-fra1-01 as their PRIMARY. :) I hope we can figure something out.

christkv commented 8 years ago

We are tracking a server ticket here https://jira.mongodb.org/browse/SERVER-1889. I would love for something like this to be possible.

We should also file a ticket with DigitalOcean pointing out the mistake they are making and how it's affecting their users.

christkv commented 8 years ago

by the way you can remove and re-add the replicaset members with their new names being ips

ArinCantCode commented 6 years ago

Having a similiar issue, after around 12-24hours of being connected our we get an error "No primary server available"

Restarting usually fixes the issue.

connection: { "url": "mongodb://user:password@cluser-shard-00-00, cluser-shard-00-01, cluster-shard-00-02/settings?ssl=true&replicaSet=primarycluster-shard-0&authSource=admin&retryWrites=true", "options": { "db": { "w": 1, "wtimeout": 3000, "fsync": true }, "authSource": "admin", "server": { "poolSize": 3, "socketOptions": { "autoReconnect": true, "keepAlive": 60000, "connectTimeoutMS": 7000, "socketTimeoutMS": 15000 } } }, "password": "password", "username": "username" }

vishald2509 commented 1 year ago

We added &readPreference=secondaryPreferred

this reduced the operations on primary and we did't get no primary found error anymore. This could be a temporary fix for us once there are more number of write operation i believe this might happen again.

MongoClient Version used 3.5