coinbase / mongobetween

Apache License 2.0
111 stars 47 forks source link

Cancel other requests if server is marked unknown #14

Closed mdehoog closed 4 years ago

mdehoog commented 4 years ago

During maintenance / failovers, sometimes there's extended periods of timeout errors from queries on servers that have been marked unknown from a previous query. This is because a number of queries have already started, and there's no shared context.

This PR introduces a shared context.Context for each server. This context is canceled whenever that server's description kind is marked as unknown. This shared context is passed to checkoutConnection and roundTrip so that any inflight calls are canceled for servers that switch to unknown.

It also adds a new goroutine for each roundTrip call which monitors the context cancelation and times-out the connection using SetDeadline(time.Now()). This is because WriteWireMessage and ReadWireMessage are blocked on connection reads/writes.

~Perhaps this should be upstreamed into the driver?~

divjotarora commented 4 years ago

@mdehoog I've only taken a cursory look at the code changes so far, but I have a couple of thoughts about driver-specific things:

  1. The connection pooling specification doesn't require drivers to close checked-out connections when a pool is invalidated. Instead, all existing connections are marked invalid and are lazily pruned. Checked-out connections will be pruned when they're returned to the pool after the operation is complete.

  2. I've filed https://jira.mongodb.org/browse/GODRIVER-1662 to address the context cancellation semantics in the driver. We can probably go back and clean up some of the code in this PR once that's been done because you won't need to reach into the driver.Connection type to get the nc field and manually set its deadline.

-- Divjot

mdehoog commented 4 years ago

Would love suggestions to fix the failure mode outlined under the Before heading below, outside of the changes made in this PR which seem to exacerbate the issue.

Before

This fix was designed to fix a failure mode that we see during failovers where a subset of hosts would not detect the replica set member / mongos being down and would continue to connect, seeing connection refused errors for up to 10 seconds until the monitor thread updated the server to unknown (x-axis ticks on these charts is 1 second):

Screen Shot 2020-06-23 at 11 21 41 AM

Note that these errors were limited to a single application host, the other hosts experienced a single error and marked the server as unknown immediately, as expected:

Screen Shot 2020-06-23 at 11 20 41 AM

After

Applying this PR's change introduced a different failure mode, instead seeing extended periods of dial tcp: operation was canceled, and auth error: ... context canceled: Screen Shot 2020-06-23 at 11 52 58 AM Screen Shot 2020-06-23 at 11 53 09 AM

It also increased the count of errors per instance:

Screen Shot 2020-06-23 at 11 53 24 AM

Because of this, we'll explore other options for fixing the failure mode described above.

divjotarora commented 4 years ago

@mdehoog I believe this issue is caused by https://jira.mongodb.org/browse/GODRIVER-1642, which has already been fixed and will be released in 1.3.5. The issue is that the server description constructed at https://github.com/mongodb/mongo-go-driver/blob/v1.3.4/x/mongo/driver/topology/server.go#L253 doesn't have an address set, so it's ignored by the topology.apply function, which prevents the topology from setting the server to Unknown. GODRIVER-1642 prevents that by introducing new constructors for description.Server that require an address to be passed.

Note that 1.3.5 will also include https://jira.mongodb.org/browse/GODRIVER-1658, so this error handling will be called from connection.connect rather than from Server.Connection. There should be no difference for connections created by an application thread while executing an operation, but it is important for connections created in the background such as the ones created by the maintenance routine for minPoolSize. My testing was on the tip of the release/1.3 branch, which doesn't yet include GODRIVER-1658, but I can test again once it's merged.

In my local testing, I was able to get mongobetween to log that the driver had marked the server Unknown when testing against the tip of release/1.3, so it seems like GODRIVER-1642 will resolve this issue. Because I was testing against a standalone, all future requests resulted in a server selection error rather than just connection refused because there were no suitable servers, as expected.

-- Divjot

divjotarora commented 4 years ago

I'm now testing on a sharded cluster with two mongos nodes. For testing, I commented out the Accept and Close logs in mongobetween because they were spamming my console when there was connection churn. Testing against v1.3.4, I see multiple sets of these logs:

{"level":"error","ts":1592955514.7441244,"caller":"proxy/connection.go:53","message":"Error handling message","remote_address":"127.0.0.1:55884","error":"connection() : dial tcp 127.0.0.1:27017: connect: connection refused"}
{"level":"error","ts":1592955517.7456164,"caller":"proxy/connection.go:96","message":"Round trip error","remote_address":"127.0.0.1:55882","error":"connection() : dial tcp 127.0.0.1:27017: connect: connection refused","cursor_id":0,"op_code":2013}

Testing against the tip of release/1.3, I see

{"level":"error","ts":1592955514.7441244,"caller":"proxy/connection.go:53","message":"Error handling message","remote_address":"127.0.0.1:55884","error":"connection() : dial tcp 127.0.0.1:27017: connect: connection refused"}
{"level":"error","ts":1592955517.7456164,"caller":"proxy/connection.go:96","message":"Round trip error","remote_address":"127.0.0.1:55882","error":"connection() : dial tcp 127.0.0.1:27017: connect: connection refused","cursor_id":0,"op_code":2013}
{"level":"info","ts":1592956017.0037165,"caller":"mongo/topology_monitor.go:29","message":"Topology changed","old_kind":"Sharded","new_kind":"Sharded","old_address_0":"localhost:27017","old_kind_0":"Mongos","old_address_1":"localhost:27018","old_kind_1":"Mongos","new_address_0":"localhost:27017","new_kind_0":"Unknown","new_address_1":"localhost:27018","new_kind_1":"Mongos"}

The connection refused error only happens once and the driver immediately discovers the new topology. From that point on, the requests sent by my testing script succeed and generate no logs in mongobetween.

divjotarora commented 4 years ago

I think I've successfully created a repro test for this to do all of the proxy orchestration. Fails against driver v1.3.4 and passes against tip of release/1.3: https://github.com/divjotarora/mongobetween/commit/a26a4c1d80c5b2b77322c0d8762275d47b28c7dd.

This assumes a sharded cluster is running with mongos nodes at localhost:27017 and localhost:27018. Likely not something that can run in CI, but running locally is enough to repro the issue. I can also share the steps I was originally using to manually reproduce if anyone's interested in those instead.

mdehoog commented 4 years ago

Nice! Thank you for the excellent in-depth analysis, and for the repro. We may look into the complexity of running a sharded cluster in CI so we can upstream the test.

Closing this PR for now, will wait to update the Go driver once released.