atomix / copycat

A novel implementation of the Raft consensus algorithm
http://atomix.io/copycat
Apache License 2.0
581 stars 155 forks source link

Unhandled `ConnectException` hangs client with large query payload #225

Open yusefnapora opened 8 years ago

yusefnapora commented 8 years ago

Hi, we've been struggling with intermittent connection issues when using copycat, and just managed to trim it down to a small reproducible case.

The problem manifests as an uncaught java.net.ConnectException that seems to cause the client to "hang", and prevents copycat's reconnection logic from kicking in.

I made a test project that triggers the issue by returning a large payload from a Query, although we've hit it intermittently with much smaller payloads. With a smaller payload, it seems somewhat dependent on how much load the CPU is under; higher load will hit the exception sooner.

The test project doesn't use a recovery strategy or listen for client state change events, but in our real project, after the ConnectException hits, we don't get a client state change event, and the retry logic always fails. Immediately after the exception, the logs show a "connecting / connected" sequence, but in our real-world project, nothing happens after that, and our attempts to retry the query always fail.

Here's some log output from the test client:

[info] Running io.mediachain.testapps.RunClient 127.0.0.1:45678
[copycat-client-io-1] INFO io.atomix.catalyst.transport.NettyClient - Connecting to /127.0.0.1:45678
[catalyst-event-loop-1] INFO io.atomix.catalyst.transport.NettyClient - Connected to /127.0.0.1:45678
[copycat-client-io-1] INFO io.atomix.copycat.client.session.ClientSession - Registered session 48
[copycat-client-io-1] INFO io.atomix.catalyst.transport.NettyClient - Connecting to /127.0.0.1:45678
[catalyst-event-loop-2] INFO io.atomix.catalyst.transport.NettyClient - Connected to /127.0.0.1:45678
[run-main-0] INFO io.mediachain.testapps.RunClient$ - received ping response from server: Pong()
[run-main-0] INFO io.mediachain.testapps.RunClient$ - got ConnectException: java.net.ConnectException: failed to connect
[copycat-client-io-1] INFO io.atomix.catalyst.transport.NettyClient - Connecting to /127.0.0.1:45678
[catalyst-event-loop-1] INFO io.atomix.catalyst.transport.NettyClient - Connected to /127.0.0.1:45678
[error] (run-main-0) java.net.ConnectException: failed to connect
java.net.ConnectException: failed to connect
        at io.atomix.copycat.client.util.ClientConnection.sendRequest(ClientConnection.java:126)
        at io.atomix.copycat.client.util.ClientConnection.lambda$handleResponse$12(ClientConnection.java:151)
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
        at io.atomix.catalyst.transport.NettyConnection.lambda$null$16(NettyConnection.java:398)
        at io.atomix.catalyst.util.concurrent.Runnables.lambda$logFailure$17(Runnables.java:20)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

We've tried debugging in copycat's ClientConnection and catalyst's NettyConnection, but so far haven't been able to sort it out, and figured we should ask for help :)

kuujo commented 8 years ago

I saw you mentioned Query, does this only happen with Query? Second question: does the exception bubble up to the CompletableFuture returned by submit?

The reasons I ask these two questions is because from Copycat's perspective it's not necessarily incorrect that this exception be raised and that a query future be completed exceptionally by it. Indeed, completing query futures exceptionally is an intended behavior of the client for reasons I'll explain. If an exception is not bubbling up to the future but is instead being eaten then that's a bug.

The reason clients' queries are allowed to fail on connection issues is because there are issues with attempting to resubmit a client's queries to the cluster. When commands are submitted, they can be resubmitted in the event of a failure and the leader ensures that duplicate commands are dropped and out-of-sequence commands are reordered when they're applied to the state machine. But because queries can read from followers, and because clients can switch servers, it's much more difficult to maintain sequential consistency for queries when a client is allowed to transparently resubmit them. Thus, if a query fails once then it is failed completely to force the user to resubmit the query themselves, implying a new operation is created. If Copycat attempts to resubmit a failed query after switching followers, that follower has no way to determine the order in which that query occurred compared to other operations since followers don't share state with other followers, so the query is failed after one try instead. So, that's all to say that this exception is a part of normal operation, the question is why is it blocking things.

Fortunately, we found and fixed a bug that prevented some client-side state from being updated when a query was failed, and that bug is fixed in master. I'll be releasing it today. Have you tried this with the master branch?

yusefnapora commented 8 years ago

It does only seem to happen with Query, and the exception is bubbled up through the CompletableFuture. Thanks for the explanation; it makes sense that you'd want to fail queries and let the client sort it out.

Do you have any intuition about what might be causing the connection to fail in the first place? We can resubmit the failed queries, but our current design involves sending fairly large blocks of data in the query response, and that seems to always kill the connection. We may be hitting a limit somewhere in the netty transport, but I couldn't figure out where it was exactly.

I'll try against master and let you know how it shakes out :)

parkan commented 8 years ago

@kuujo to clarify, the query always fails in this way either when the returned block object exceeds a certain size (~64k), or when a smaller object is requested a specific number of times. After the cluster gets into this state all further requests fail, so manual retries don't seem to help and the entire connection pool is DOA (from the client's point of view)

yusefnapora commented 8 years ago

I just tried with copycat master, and the issue is improved. In our real project we've got our own retry logic in the client, and it was previously failing after an initial failed query. Now it will succeed on the retry after an initial failure, provided the block size isn't too big.

The connection issue is still there, though; once the block gets big enough all the queries fail. Progress though!

Thanks :)

kuujo commented 8 years ago

Cool. Great to see that it's not blocking the client from progressing. From here I think the query failures feel more like a tuning issue than anything else.

A couple things to note:

What ClientConnection does is attempts to connect to all servers in the cluster and then fails a request once requests to all servers fail. If the operation is a command it will be retried depending on the exception, and if it's a query the query will be failed. The static request timeout can cause this, so I think that needs to be resolved.

But TBH this just seems to be an issue with the frame length which is way too small right now. It should be more like 64mb not 64k. I'll commit a fix for the frame length and update the Catalyst dependency before release.

parkan commented 8 years ago

Thank you @kuujo, that sounds like it could be a fix. I looked into message size limits for copycat but didn't see any obvious hard ceiling. We did run into a TooLongFrameException under slightly different circumstances, it's possible that maybe this happens elsewhere in the stack (due to headers or something like that) and the informative exception is swallowed or rewritten to a not-helpful channel close.

One thing to note is that we do eventually see this problem after making (potentially thousands) of requests with a payload << 64k, which suggests a leak or buffer fill somewhere. We'll try to repro that case against the new code and report back.

kuujo commented 8 years ago

Catalyst is updated and the Copycat snapshot has been pushed. You can now configure the frame size and request timeout via the NettyTransport builder. Once we work through this issue and I have a chance to do some more testing with those Catalyst changes I'll push new minor releases.

If you were using the last Copycat release you might have some compiling issues since we had to move the packaging around in Catalyst for OSGi bundles.

vyzo commented 8 years ago

Yeah, we noticed the compilation problems: copycat 1.0.1 doesn't compile with catalyst 1.1.0, so we were using 1.0.8.

yusefnapora commented 8 years ago

@kuujo thanks so much for your help with this. Installing catalyst and copycat from master seems to have resolved our issues. Looking forward to the new releases 😄 Excellent work on copycat, btw.

kuujo commented 8 years ago

Great! Will push it monentito :-)

On May 26, 2016, at 7:51 AM, Yusef Napora notifications@github.com wrote:

@kuujo thanks so much for your help with this. Installing catalyst and copycat from master seems to have resolved our issues. Looking forward to the new releases 😄 Excellent work on copycat, btw.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

kuujo commented 8 years ago

Sorry, all the OSGi compatibility and related compilation issues will be resolved today. Catalyst 1.1 and Copycat 1.1 will be compatible with each other.

On May 26, 2016, at 1:12 AM, vyzo notifications@github.com wrote:

Yeah, we noticed the compilation problems: copycat 1.0.1 doesn't compile with catalyst 1.1.0, so we were using 1.0.8 for catalyst.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub