Open neverchanje opened 5 years ago
2019-09-11 17:01:18,424 WARN ReplicaSession.tryNotifyWithSequenceID: actively close the session because it's not responding for 10 seconds
...
2019-09-11 17:12:39,607 WARN ReplicaSession.tryNotifyWithSequenceID: actively close the session because it's not responding for 10 seconds
Bugfix https://github.com/XiaoMi/pegasus-java-client/pull/32 was working in the situation, but the session didn't close. The last ERR_SESSION_RESET is at 17:01.
2019-09-11 17:01:06,910 WARN TableHandler.onRpcReply: replica server(rpc_address(10.38.161.207:32801)) doesn't serve gpid(gpid(13.3)), operator(com.xiaomi.infra.pegasus.operator.rrdb_multi_put_operator@41e12d6f), try(3), error_code(ERR_SESSION_RESET), need query meta
The root cause is that the connection was not even established when it tried to close the session.
2019-09-11 17:12:19,575 WARN ReplicaSession.tryNotifyWithSequenceID: actively close the session because it's not responding for 10 seconds
2019-09-11 17:12:19,575 INFO ReplicaSession.closeSession: channel rpc_address(10.38.161.207:32801) not connected, skip the close
The client repeatedly tried to reconnect to this server, but it didn't succeed.
2019-09-11 17:01:05,830 WARN ReplicaSession$2.operationComplete(ReplicaSession.java:153) - rpc_address(10.38.161.207:32801): try to connect to target failed
...
2019-09-11 17:01:06,909 WARN ReplicaSession$2.operationComplete(ReplicaSession.java:153) - rpc_address(10.38.161.207:32801): try to connect to target failed
2019/9/11 17:00. Our SRE stopped one instance of replica-server in our staging environment trying to simulate the problem java-client can't recover.
2019/9/11 17:00. Some of our clients recovered right away while replica-server restarted, but some couldn't reconnect and kept retrying with ERR_TIMEOUT error.
Client Version
1.11.5-thrift-0.11.0-inlined-release