XiaoMi / pegasus-java-client

Has been migrated to https://github.com/apache/incubator-pegasus/tree/master/java-client
Apache License 2.0
31 stars 25 forks source link

bug: client can't recover from a replica-server failure. #53

Open neverchanje opened 5 years ago

neverchanje commented 5 years ago
2019-09-11 17:12:48,659 ERROR [nioEventLoopGroup-4-1] [com.xiaomi.message.mixin.fsm.ruleset.SaveAckRuleSet$3$1.operationComplete(SaveAckRuleSet.java:169)] - SaveB2CAckInfo fail, appId=2882303761517479657 msgId=sms-1c33f956-7dad-4327-8986-aa400cb4a8b2, ex:
com.xiaomi.infra.pegasus.client.PException: com.xiaomi.infra.pegasus.rpc.ReplicationException: ERR_TIMEOUT
    at com.xiaomi.infra.pegasus.client.PegasusTable$8.onCompletion(PegasusTable.java:376)
    at com.xiaomi.infra.pegasus.rpc.async.ClientRequestRound.thisRoundCompletion(ClientRequestRound.java:51)
    at com.xiaomi.infra.pegasus.rpc.async.TableHandler.tryDelayCall(TableHandler.java:314)
    at com.xiaomi.infra.pegasus.rpc.async.TableHandler.onRpcReply(TableHandler.java:295)
    at com.xiaomi.infra.pegasus.rpc.async.TableHandler$3.run(TableHandler.java:326)
    at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.tryNotifyWithSequenceID(ReplicaSession.java:226)
    at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.access$300(ReplicaSession.java:32)
    at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession$4.run(ReplicaSession.java:270)
    at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
    at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:123)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.xiaomi.infra.pegasus.rpc.ReplicationException: ERR_TIMEOUT
    ... 15 more

Client Version

1.11.5-thrift-0.11.0-inlined-release

neverchanje commented 5 years ago
2019-09-11 17:01:18,424 WARN ReplicaSession.tryNotifyWithSequenceID: actively close the session because it's not responding for 10 seconds
...
2019-09-11 17:12:39,607 WARN ReplicaSession.tryNotifyWithSequenceID: actively close the session because it's not responding for 10 seconds

Bugfix https://github.com/XiaoMi/pegasus-java-client/pull/32 was working in the situation, but the session didn't close. The last ERR_SESSION_RESET is at 17:01.

2019-09-11 17:01:06,910 WARN TableHandler.onRpcReply: replica server(rpc_address(10.38.161.207:32801)) doesn't serve gpid(gpid(13.3)), operator(com.xiaomi.infra.pegasus.operator.rrdb_multi_put_operator@41e12d6f), try(3), error_code(ERR_SESSION_RESET), need query meta

The root cause is that the connection was not even established when it tried to close the session.

2019-09-11 17:12:19,575 WARN ReplicaSession.tryNotifyWithSequenceID: actively close the session because it's not responding for 10 seconds
2019-09-11 17:12:19,575 INFO ReplicaSession.closeSession: channel rpc_address(10.38.161.207:32801) not connected, skip the close

The client repeatedly tried to reconnect to this server, but it didn't succeed.

2019-09-11 17:01:05,830 WARN ReplicaSession$2.operationComplete(ReplicaSession.java:153) - rpc_address(10.38.161.207:32801): try to connect to target failed
...
2019-09-11 17:01:06,909 WARN ReplicaSession$2.operationComplete(ReplicaSession.java:153) - rpc_address(10.38.161.207:32801): try to connect to target failed