MongoDB cluster issues: retry on disconnect does not fail over, and write errors do not fail over

In the event that a member of the replica set becomes unable to respond during a particular find() or insert() or similar operation, there is no automatic failover. The operation fails with an error. Our preference would be to automatically retry it.

According to notes by Thomas Chraibi based on input from Charles Sarrazin of MongoDB, one approach to achieve that is to simply re-attempt the find() or insert() call in question. This will result in the MongoDB driver discovering a functioning replica set node to reconnect to, as in the following pseudocode:

const client = ...
const coll = client.db('cool').collection('llamas');

function attemptInsert(doc, callback) {
  coll.insert(doc, (err, result) => {
    if (err) {
      // you might want to check for some errors, as they might be unrecoverable
      // if (err.msg === 'SomeImportantErrorType') return callback(new Error('unrecoverable!'));

      // now recursively call `attemptInsert` to perform server selection again
      attemptInsert(doc, callback);
      return;
    }

    callback(); // success
  });
}

In addition, there is apparently some sort of issue with our autoReconnect configuration:

        autoReconnect: true,
        // retry forever
        reconnectTries: Number.MAX_VALUE,
        reconnectInterval: 1000

Apparently this will keep retrying the connection to a node that is down for as long as that node is down, which is not ideal.

However it is unclear to me why this should occur, while find() and insert() operations apparently will continue to make new connections to other nodes as needed according to the pseudocode that was provided above.

So, more clarification is needed on the following points before implementation can be completed:

In what situation does the autoReconnect behavior come into play?
If it is undesirable, what approach would ensure we eventually get connected again to an appropriate node?
If new find() and insert() operations already reconnect as needed, is there any value in using autoReconnect at all? What value would that be?
What MongoDB errors can be safely classed as "this node is ill or unavailable," as opposed to "you did something you should not have" (examples: oversize document, illegal operation, unique key, etc)?

cc @mbroadst @csarrazi

Hello @mbroadst @csarrazi, any word on this? We do need your input in order to help our mutual customer with this. Thanks!

Hi @boutell, sorry lost track of this issue in the mix. Here are the answers to your questions:

In what situation does the autoReconnect behavior come into play? autoreconnect is specifically related to the connection pool, and whether it will be resilient to network failures, attempting to reconnect connections to the server associated with the connection pool.

If it is undesirable, what approach would ensure we eventually get connected again to an appropriate node? autoreconnect does provide the resiliency you desire in this case, connecting to the same node you were once connected to.

If new find() and insert() operations already reconnect as needed, is there any value in using autoReconnect at all? What value would that be? The find and insert operations themselves do not presently reconnect.

What MongoDB errors can be safely classed as "this node is ill or unavailable," as opposed to "you did something you should not have" (examples: oversize document, illegal operation, unique key, etc)? I budding list is growing here, based on a number of our specs.

Now to be a little more specific: The connection pool already has support for auto reconnection, and an ability to complete/retry operations once such a connection is reconnected. This is an OK design for a single server connection, however is not ideal for any more complicated topology because of a few reasons:

you lose the benefits of a distributed topology, waiting for the operation to complete on a recovering node
there are no guarantees that the node ever recovers, and therefore your operation will be stuck in limbo, or outright fail without having ever been retried.

There are further issues with the way this retryability has been implemented, specifically that there can essentially be no ordering of operations once they are called and fall into this queue; you can queue up five writes, they can each select a different server, and one of them could potentially be stuck for quite some time (and maybe forever), while you think it went through. Very scary indeed.

We have been trying to tackle these issues holistically in core engineering (drivers + server) at MongoDB for the past year, and come up with some better solutions: retryable writes, and retryable reads. Retryable writes are already available today in the driver, and support versions 3.6+ of the server. Retryable reads are in the specification process and should be available in the next two quarters.

Now back to "what can we do today." I provided the example code above as an example of something that might end up being a general skeleton for retryable reads in the node driver. When connected to a replica set, each operation called on the driver first performs "server selection," allowing the user to specify things like a read preference. Once a server is selected the operation is queued/written on the connection pool associated with that internal representation of the server. In order to provide the kind of highly available retryability that our existing retryability has led users to believe they have, one would need to move the retryability up a level, and incorporate server selection. Essentially, once the operation fails with a retryable error, the topology should retry server selection and enqueue the operation on an available node for immediate execution.

Unfortunately, we cannot speed up our time table to implement retryable reads for y'all, but perhaps we can work together on a module that prototypes it if that suits your needs? I imagine the best place to inject this would be at the topology level (which is unfortunately a bit in flux at the moment), monkey-patching the primitives to provide operation reexection would be a good place to start.

Hope this helps, please let me know if I can clarify anything.

Matt, thanks for these clarifications.

Please let me know if this summary of my understanding is correct:

The existing autoreconnect feature is unsuitable for replica sets because it will retry the connection to the same node, forever, and will not perform server selection.
In a SINGLE-SERVER configuration, autoreconnect makes sense because it reconnects to the one and only server available.
In a REPLICA SET connection, autoreconnect isn't just inappropriate, it is redundant, because...
In a REPLICA SET connection, the driver will automatically perform server selection at the start of each NEW operation, such as a read or write. This means that we don't need to specifically "reconnect," because the server selection mechanism will take care of that. If a node dies, the next read operation will figure that out during "server selection." We don't have to do anything to get the next read operation to work.
However, the "server selection" mechanism IS NOT triggered to retry a specific read or write that JUST failed. Only on, for instance, the next read or write.
The latest driver offers "retryable writes" which address this desire for retry of single operations, for write operations. This must be specifically opted into.
There are however still no "retryable reads." This feature is under development, but if we want it today, our option is to recognize the types of errors that constitute connection and/or "problem with the node" errors, and try a new read ourselves.
If we want to retry a read ourselves, our ONLY responsibility is to issue a new read to the client. The client will AUTOMATICALLY perform server selection once as needed for each new read operation. It just won't do it AGAIN if the first attempt fails; retrying that is up to us.
There is also a possibility that you could collaborate with us to create such a solution as a monkeypatch to the read operations of the driver.

Do I have all of this right? If not, please let me know where any misunderstanding might still lie.

Thanks!

On Fri, Aug 10, 2018 at 7:24 AM, Matt Broadstone notifications@github.com wrote:

Hi @boutell https://github.com/boutell, sorry lost track of this issue in the mix. Here are the answers to your questions:

In what situation does the autoReconnect behavior come into play? autoreconnect is specifically related to the connection pool, and whether it will be resilient to network failures, attempting to reconnect connections to the server associated with the connection pool.

If it is undesirable, what approach would ensure we eventually get connected again to an appropriate node? autoreconnect does provide the resiliency you desire in this case, connecting to the same node you were once connected to.

If new find() and insert() operations already reconnect as needed, is there any value in using autoReconnect at all? What value would that be? The find and insert operations themselves do not presently reconnect.

What MongoDB errors can be safely classed as "this node is ill or unavailable," as opposed to "you did something you should not have" (examples: oversize document, illegal operation, unique key, etc)? I budding list is growing here https://github.com/mongodb-js/mongodb-core/blob/master/lib/error.js#L101, based on a number of our specs.

Now to be a little more specific: The connection pool already has support for auto reconnection, and an ability to complete/retry operations once such a connection is reconnected. This is an OK design for a single server connection, however is not ideal for any more complicated topology because of a few reasons:

you lose the benefits of a distributed topology, waiting for the operation to complete on a recovering node

there are no guarantees that the node ever recovers, and therefore your operation will be stuck in limbo, or outright fail without having ever been retried.

There are further issues with the way this retryability has been implemented, specifically that there can essentially be no ordering of operations once they are called and fall into this queue; you can queue up five writes, they can each select a different server, and one of them could potentially be stuck for quite some time (and maybe forever), while you think it went through. Very scary indeed.

We have been trying to tackle these issues holistically in core engineering (drivers + server) at MongoDB for the past year, and come up with some better solutions: retryable writes, and retryable reads. Retryable writes https://docs.mongodb.com/manual/core/retryable-writes/ are already available today in the driver, and support versions 3.6+ of the server. Retryable reads are in the specification process and should be available in the next two quarters.

Now back to "what can we do today." I provided the example code above as an example of something that might end up being a general skeleton for retryable reads in the node driver. When connected to a replica set, each operation called on the driver first performs "server selection," allowing the user to specify things like a read preference. Once a server is selected the operation is queued/written on the connection pool associated with that internal representation of the server. In order to provide the kind of highly available retryability that our existing retryability has led users to believe they have, one would need to move the retryability up a level, and incorporate server selection. Essentially, once the operation fails with a retryable error, the topology should retry server selection and enqueue the operation on an available node for immediate execution.

Unfortunately, we cannot speed up our time table to implement retryable reads for y'all, but perhaps we can work together on a module that prototypes it if that suits your needs? I imagine the best place to inject this would be at the topology level (which is unfortunately a bit in flux at the moment), monkey-patching the primitives https://github.com/mongodb/node-mongodb-native/blob/master/lib/topologies/topology_base.js#L314-L332 to provide operation reexection would be a good place to start.

Hope this helps, please let me know if I can clarify anything.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apostrophecms/apostrophe/issues/1508#issuecomment-412054972, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB9fR2IXg7OJKYarOF-GkSyeCYzpvzJks5uPW14gaJpZM4VkjzS .

THOMAS BOUTELL, CHIEF SOFTWARE ARCHITECT P'UNK AVENUE | (215) 755-1330 | punkave.com

Hi @mbroadst,

Ludo visited us here in Philadelphia and provided some information toward a better understanding. We'd like to run our combined understanding past you.

Autoreconnect: no good with replica sets for all the reasons I already summarized. Fine here.
However, everything I was hoping about "the next request" working was completely wrong. (: Specifically, the driver will never start trying requests on another node on its own, even though it knows a list of nodes via the URI (old style) or via mongodb+srv DNS records (new style).
This means that we must, at some point, respond to errors relating to the network or a disabled node by forcing the driver to perform server selection. Simply retrying inserts/updates/reads will never do this (except possibly using "retryable writes" in the latest driver and mongodb version).
So we need to choose the right way of forcing server selection with existing drivers.
We could simply open a new connection with the original URI. This would work and would be upwardly compatible with mongodb+srv, however it may not be the most efficient option available.
Specifically, querying the topology and healing the connection in some way might be a bit more efficient. However, this layer is "in flux" as you earlier mentioned.
So the most backwards and forwards compatible approach is to try opening new connections in these scenarios, and then retrying the operations.
We can then impose our own time limits on how much we retry and so forth.

Is our understanding correct?

Thanks!

Hi @boutell, answers inline:

Autoreconnect: no good with replica sets for all the reasons I already summarized. Fine here.

Yep!

However, everything I was hoping about "the next request" working was completely wrong. (: > Specifically, the driver will never start trying requests on another node on its own, even though it knows a list of nodes via the URI (old style) or via mongodb+srv DNS records (new style).

This is incorrect. The driver does do server discovery and monitoring, and keeps an internal up-to-date list of servers in the topology. Each time you execute an operation, a server is selected from the internal topology description based on the provided read preference (or primary by default).

This is what I was referring to before when I provided the initial pseudocode for retryability:

call the operation (lets say collection.insert)
- the operation selects the current primary
an error occurs, and collection.insert is called again (per above pseudocode)
- the operation selects the new primary

This means that we must, at some point, respond to errors relating to the network or a disabled node by forcing the driver to perform server selection. Simply retrying inserts/updates/reads will never do this (except possibly using "retryable writes" in the latest driver and mongodb version).

This is incorrect, see above. I think the rest of your questions are operating on this assumption, so I'll stop here. Retryable writes actually implement something close to the pseudo-code I initially provided, but they also use a transaction number used server-side to ensure write-at-most-once semantics.

So we need to choose the right way of forcing server selection with existing drivers.

We could simply open a new connection with the original URI. This would work and would be upwardly compatible with mongodb+srv, however it may not be the most efficient option available.

Specifically, querying the topology and healing the connection in some way might be a bit more efficient. However, this layer is "in flux" as you earlier mentioned.

So the most backwards and forwards compatible approach is to try opening new connections in these scenarios, and then retrying the operations.

We can then impose our own time limits on how much we retry and so forth.

Ah, that's a huge help, thank you. So some developers think the driver does more than it does (the autoreconnect misunderstanding), and others think it does less than it does (:

So a reasonable strategy would be:

Don't use autoreconnect.
If an individual request fails, and the error code smells like it's network-y or broken-node-y, simply try that request again (up to some limit of our choosing).

We can do that.

Just to confirm, this means that as long as you're connected to a replica set, the driver is capable of eventually getting "back on the air" even if connection is completely lost to all of the nodes for a period of time? To the point where the TCP connection(s) close(s)?

That is, if we connect with this URI, the driver can figure it out and try other nodes for as long as it has to, and eventually even notice the original node is back:

mongodb://localhost:27018,localhost:27019,localhost:27020

(Or mongodb+srv, of course)

But with a single-server URI like this, we would have to use the autoreconnect option, or else reconnect ourselves:

mongodb://localhost:27017

I'm going to try some experiments to test my assumptions here.

Thanks!

On Thu, Aug 16, 2018 at 3:37 PM, Matt Broadstone notifications@github.com wrote:

Hi @boutell https://github.com/boutell, answers inline:

Autoreconnect: no good with replica sets for all the reasons I already summarized. Fine here. Yep!

However, everything I was hoping about "the next request" working was completely wrong. (: > Specifically, the driver will never start trying requests on another node on its own, even though it knows a list of nodes via the URI (old style) or via mongodb+srv DNS records (new style). This is incorrect. The driver does do server discovery and monitoring, and keeps an internal up-to-date list of servers in the topology. Each time you execute an operation, a server is selected from the internal topology description based on the provided read preference (or primary by default).

This is what I was referring to before when I provided the initial pseudocode for retryability:

call the operation (lets say collection.insert)

the operation selects the current primary

an error occurs, and collection.insert is called again (per above pseudocode)

the operation selects the new primary

This means that we must, at some point, respond to errors relating to the network or a disabled node by forcing the driver to perform server selection. Simply retrying inserts/updates/reads will never do this (except possibly using "retryable writes" in the latest driver and mongodb version). This is incorrect, see above. I think the rest of your questions are operating on this assumption, so I'll stop here.

So we need to choose the right way of forcing server selection with existing drivers.

We could simply open a new connection with the original URI. This would work and would be upwardly compatible with mongodb+srv, however it may not be the most efficient option available.

Specifically, querying the topology and healing the connection in some way might be a bit more efficient. However, this layer is "in flux" as you earlier mentioned.

So the most backwards and forwards compatible approach is to try opening new connections in these scenarios, and then retrying the operations.

We can then impose our own time limits on how much we retry and so forth.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apostrophecms/apostrophe/issues/1508#issuecomment-413660753, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB9faqzZqe8Z0W9Pw8t8ZN80zDBF9uBks5uRcoEgaJpZM4VkjzS .

THOMAS BOUTELL, CHIEF SOFTWARE ARCHITECT P'UNK AVENUE | (215) 755-1330 | punkave.com

Ah, that's a huge help, thank you. So some developers think the driver does more than it does (the autoreconnect misunderstanding), and others think it does less than it does (:

Indeed! Unfortunately, it's proven somewhat difficult to find time to properly document all of this in my short tenure here. My improved SDAM layer should also alleviate much of this, and will include design documentation - more on that later.

So a reasonable strategy would be:

Don't use autoreconnect.

IMHO, if your goal is resiliency then I would simply never use autoreconnect. There is a marginal case for its use with a single server connection, but only just so.

If an individual request fails, and the error code smells like it's network-y or broken-node-y, simply try that request again (up to some limit of our choosing).

Yep! But I want to make some things very clear: this implementation of retryability (all client side) is subject to errors for writes specifically. The design of retryable writes requires a server-side transaction number, which allows us to verify that a write was made to the oplog at most one time. If you implement retryability on the client side using the pseudo-code I provided above, you run the risk of having multiple writes reach the oplog.

We can do that.

Just to confirm, this means that as long as you're connected to a replica set, the driver is capable of eventually getting "back on the air" even if connection is completely lost to all of the nodes for a period of time? To the point where the TCP connection(s) close(s)?

Yes. If you have a spare afternoon (ha ha), you might want to peruse our Server Discovery and Monitoring specification. The node driver presently implements most of this specification, and will maintain an active monitoring thread for each seed in a seedlist for the duration of your application. During this time, it will continuously update its internal knowledge of the topology, and use up-to-date information each time an operation is executed. This provides high availability, and because the isMaster responses from each node in the replicaset contains knowledge about new and removed members, the internal state of the driver keeps up even with added nodes not in the initial seed list.

That is, if we connect with this URI, the driver can figure it out and try other nodes for as long as it has to, and eventually even notice the original node is back:

mongodb://localhost:27018,localhost:27019,localhost:27020

(Or mongodb+srv, of course)

Unfortunately, mongos instances do not naturally monitor their topology so the plasticity described above does not apply to them. The initially provided seedlist, if they are mongos instances, will be the static list of monitored seeds in the driver for the duration of the application.

But with a single-server URI like this, we would have to use the autoreconnect option, or else reconnect ourselves:

mongodb://localhost:27017

A little more background on the work that is presently going into the driver. The mongo driver right now has a sort of broken concept of "connecting" to a topology (e.g. MongoClient.connect). What's really going on when you provide your connection string is that the driver is parsing it, finding individual seeds, handshaking and starting monitors for each of them, building an internal view of the topology and creating a connection pool for each known node. When you execute an operation, it first does server selection based on your read preference (by default this is primary), selects that server, then requests one connection from the pool associated with the server and sends the request along. The new SDAM layer actually allows us to drop MongoClient.connect completely:

const client = new MongoClient('mongodb://localhost:27017');
client.db('foo').collection('bar').insertOne({ some: 'document' }, (err, res) => {
  // do something
});

And thus is a more pure form of what we were talking about above - there is no real concept of "connecting". You simply are always asking to execute some operation, against some preference of server, and its up the to driver to figure it out for you.

Finally, I mentioned retryable writes above, but in the next two quarters we will be specifying and implementing retryable reads. This will boil down to the same thing you are likely to be implementing soon for Apostrophe (if that's the path you choose), in that it will retry a read operation (only starting them, not a getMore) if some sort of "retryable error" occurs (network blip, not master, etc).

Hope that helps

That helps a lot! Thank you. We'll discuss and explore.

We are not sharding so it sounds like the mongos caveat doesn't apply.

I take your point that it's possible the write could make it to the oplog twice (i.e. in some scenario be carried out twice) with a retry strategy that doesn't rely on your new retryable writes. Our client may be OK with using retryable writes and the required driver and server versions, given that the apostrophe-db-mongo-3-driver module has shipped.

On Fri, Aug 17, 2018 at 9:17 AM, Matt Broadstone notifications@github.com wrote:

Ah, that's a huge help, thank you. So some developers think the driver does more than it does (the autoreconnect misunderstanding), and others think it does less than it does (:

Indeed! Unfortunately, it's proven somewhat difficult to find time to properly document all of this in my short tenure here. My improved SDAM layer should also alleviate much of this, and will include design documentation - more on that later.

So a reasonable strategy would be:

Don't use autoreconnect.

IMHO, if your goal is resiliency then I would simply never use autoreconnect. There is a marginal case for its use with a single server connection, but only just so.

If an individual request fails, and the error code smells like it's network-y or broken-node-y, simply try that request again (up to some limit of our choosing).

Yep! But I want to make some things very clear: this implementation of retryability (all client side) is subject to errors for writes specifically. The design of retryable writes requires a server-side transaction number, which allows us to verify that a write was made to the oplog at most one time. If you implement retryability on the client side using the pseudo-code I provided above, you run the risk of having multiple writes reach the oplog.

We can do that.

Just to confirm, this means that as long as you're connected to a replica set, the driver is capable of eventually getting "back on the air" even if connection is completely lost to all of the nodes for a period of time? To the point where the TCP connection(s) close(s)?

Yes. If you have a spare afternoon (ha ha), you might want to peruse our Server Discovery and Monitoring specification https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst. The node driver presently implements most of this specification, and will maintain an active monitoring thread for each seed in a seedlist for the duration of your application. During this time, it will continuously update its internal knowledge of the topology, and use up-to-date information each time an operation is executed. This provides high availability, and because the isMaster responses from each node in the replicaset contains knowledge about new and removed members, the internal state of the driver keeps up even with added nodes not in the initial seed list.

That is, if we connect with this URI, the driver can figure it out and try other nodes for as long as it has to, and eventually even notice the original node is back:

mongodb://localhost:27018,localhost:27019,localhost:27020

(Or mongodb+srv, of course)

Unfortunately, mongos instances do not naturally monitor their topology so the plasticity described above does not apply to them. The initially provided seedlist, if they are mongos instances, will be the static list of monitored seeds in the driver for the duration of the application.

But with a single-server URI like this, we would have to use the autoreconnect option, or else reconnect ourselves:

mongodb://localhost:27017

A little more background on the work that is presently going into the driver. The mongo driver right now has a sort of broken concept of "connecting" to a topology (e.g. MongoClient.connect). What's really going on when you provide your connection string is that the driver is parsing it, finding individual seeds, handshaking and starting monitors for each of them, building an internal view of the topology and creating a connection pool for each known node. When you execute an operation, it first does server selection based on your read preference (by default this is primary), selects that server, then requests one connection from the pool associated with the server and sends the request along. The new SDAM layer actually allows us to drop MongoClient.connect completely:

const client = new MongoClient('mongodb://localhost:27017'); client.db('foo').collection('bar').insertOne({ some: 'document' }, (err, res) => { // do something });

And thus is a more pure form of what we were talking about above - there is no real concept of "connecting". You simply are always asking to execute some operation, against some preference of server, and its up the to driver to figure it out for you.

Finally, I mentioned retryable writes above, but in the next two quarters we will be specifying and implementing retryable reads. This will boil down to the same thing you are likely to be implementing soon for Apostrophe (if that's the path you choose), in that it will retry a read operation (only starting them, not a getMore) if some sort of "retryable error" occurs (network blip, not master, etc).

Hope that helps

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apostrophecms/apostrophe/issues/1508#issuecomment-413862558, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB9fax-Wv-ONjmGIXoHNuIPnG2CrtI8ks5uRsJogaJpZM4VkjzS .

THOMAS BOUTELL, CHIEF SOFTWARE ARCHITECT P'UNK AVENUE | (215) 755-1330 | punkave.com

We already fixed the retry on disconnect stuff, and mongo 3+ retryable writes are available via a mongodb URI, therefore this can be closed.

apostrophecms / apostrophe

MongoDB cluster issues: retry on disconnect does not fail over, and write errors do not fail over #1508