Closed mdehoog closed 4 years ago
@mdehoog I've only taken a cursory look at the code changes so far, but I have a couple of thoughts about driver-specific things:
The connection pooling specification doesn't require drivers to close checked-out connections when a pool is invalidated. Instead, all existing connections are marked invalid and are lazily pruned. Checked-out connections will be pruned when they're returned to the pool after the operation is complete.
I've filed https://jira.mongodb.org/browse/GODRIVER-1662 to address the context cancellation semantics in the driver. We can probably go back and clean up some of the code in this PR once that's been done because you won't need to reach into the driver.Connection
type to get the nc
field and manually set its deadline.
-- Divjot
Would love suggestions to fix the failure mode outlined under the Before heading below, outside of the changes made in this PR which seem to exacerbate the issue.
This fix was designed to fix a failure mode that we see during failovers where a subset of hosts would not detect the replica set member / mongos being down and would continue to connect, seeing connection refused
errors for up to 10 seconds until the monitor thread updated the server to unknown
(x-axis ticks on these charts is 1 second):
Note that these errors were limited to a single application host, the other hosts experienced a single error and marked the server as unknown
immediately, as expected:
Applying this PR's change introduced a different failure mode, instead seeing extended periods of dial tcp: operation was canceled
, and auth error: ... context canceled
:
It also increased the count of errors per instance:
Because of this, we'll explore other options for fixing the failure mode described above.
@mdehoog I believe this issue is caused by https://jira.mongodb.org/browse/GODRIVER-1642, which has already been fixed and will be released in 1.3.5. The issue is that the server description constructed at https://github.com/mongodb/mongo-go-driver/blob/v1.3.4/x/mongo/driver/topology/server.go#L253 doesn't have an address set, so it's ignored by the topology.apply
function, which prevents the topology from setting the server to Unknown
. GODRIVER-1642 prevents that by introducing new constructors for description.Server
that require an address to be passed.
Note that 1.3.5 will also include https://jira.mongodb.org/browse/GODRIVER-1658, so this error handling will be called from connection.connect
rather than from Server.Connection
. There should be no difference for connections created by an application thread while executing an operation, but it is important for connections created in the background such as the ones created by the maintenance routine for minPoolSize
. My testing was on the tip of the release/1.3
branch, which doesn't yet include GODRIVER-1658, but I can test again once it's merged.
In my local testing, I was able to get mongobetween to log that the driver had marked the server Unknown
when testing against the tip of release/1.3, so it seems like GODRIVER-1642 will resolve this issue. Because I was testing against a standalone, all future requests resulted in a server selection error rather than just connection refused
because there were no suitable servers, as expected.
-- Divjot
I'm now testing on a sharded cluster with two mongos nodes. For testing, I commented out the Accept
and Close
logs in mongobetween because they were spamming my console when there was connection churn. Testing against v1.3.4, I see multiple sets of these logs:
{"level":"error","ts":1592955514.7441244,"caller":"proxy/connection.go:53","message":"Error handling message","remote_address":"127.0.0.1:55884","error":"connection() : dial tcp 127.0.0.1:27017: connect: connection refused"}
{"level":"error","ts":1592955517.7456164,"caller":"proxy/connection.go:96","message":"Round trip error","remote_address":"127.0.0.1:55882","error":"connection() : dial tcp 127.0.0.1:27017: connect: connection refused","cursor_id":0,"op_code":2013}
Testing against the tip of release/1.3, I see
{"level":"error","ts":1592955514.7441244,"caller":"proxy/connection.go:53","message":"Error handling message","remote_address":"127.0.0.1:55884","error":"connection() : dial tcp 127.0.0.1:27017: connect: connection refused"}
{"level":"error","ts":1592955517.7456164,"caller":"proxy/connection.go:96","message":"Round trip error","remote_address":"127.0.0.1:55882","error":"connection() : dial tcp 127.0.0.1:27017: connect: connection refused","cursor_id":0,"op_code":2013}
{"level":"info","ts":1592956017.0037165,"caller":"mongo/topology_monitor.go:29","message":"Topology changed","old_kind":"Sharded","new_kind":"Sharded","old_address_0":"localhost:27017","old_kind_0":"Mongos","old_address_1":"localhost:27018","old_kind_1":"Mongos","new_address_0":"localhost:27017","new_kind_0":"Unknown","new_address_1":"localhost:27018","new_kind_1":"Mongos"}
The connection refused
error only happens once and the driver immediately discovers the new topology. From that point on, the requests sent by my testing script succeed and generate no logs in mongobetween
.
I think I've successfully created a repro test for this to do all of the proxy orchestration. Fails against driver v1.3.4 and passes against tip of release/1.3: https://github.com/divjotarora/mongobetween/commit/a26a4c1d80c5b2b77322c0d8762275d47b28c7dd.
This assumes a sharded cluster is running with mongos nodes at localhost:27017
and localhost:27018
. Likely not something that can run in CI, but running locally is enough to repro the issue. I can also share the steps I was originally using to manually reproduce if anyone's interested in those instead.
Nice! Thank you for the excellent in-depth analysis, and for the repro. We may look into the complexity of running a sharded cluster in CI so we can upstream the test.
Closing this PR for now, will wait to update the Go driver once released.
During maintenance / failovers, sometimes there's extended periods of timeout errors from queries on servers that have been marked
unknown
from a previous query. This is because a number of queries have already started, and there's no shared context.This PR introduces a shared
context.Context
for each server. This context iscancel
ed whenever that server's description kind is marked asunknown
. This shared context is passed tocheckoutConnection
androundTrip
so that any inflight calls are canceled for servers that switch tounknown
.It also adds a new goroutine for each
roundTrip
call which monitors the context cancelation and times-out the connection usingSetDeadline(time.Now())
. This is becauseWriteWireMessage
andReadWireMessage
are blocked on connection reads/writes.~Perhaps this should be upstreamed into the driver?~