apache / incubator-pegasus

Apache Pegasus - A horizontally scalable, strongly consistent and high-performance key-value store
https://pegasus.apache.org/
Apache License 2.0
1.99k stars 312 forks source link

zk c client 3.7 does not run well with zk server 3.4 #1092

Open foreverneverer opened 2 years ago

foreverneverer commented 2 years ago

Bug Report

https://github.com/XiaoMi/rdsn/pull/979 upgrade zk c client to 3.7(from 3.4.10) for kerberos, but it sometime can't access zk server 3.4.5

In my test, when I upgrade the meta to latest pegasus server, the meta error log:

E2022-07-27 11:41:16.67 (1658893276067532317 139122)   meta.default0.00001f3100010001: meta_service.cpp:463:start(): initialize server state from remote storage failed, err = ERR_INCONSISTENT_STATE, retry ...
F2022-07-27 11:41:16.69 (1658893276069053850 139122)   meta.default0.00001f3100010001: server_state.cpp:709:sync_apps_from_remote_storage(): assertion expression: ERR_OK == err
F2022-07-27 11:41:16.69 (1658893276069061403 139122)   meta.default0.00001f3100010001: server_state.cpp:709:sync_apps_from_remote_storage(): can't handle this error (ERR_TIMEOUT)

ERR_INCONSISTENT_STATE is from zk error ZRUNTIMEINCONSISTENCY, see the zk error log:

2022-07-28 16:25:15,412:63056(0x7f2f8d699700):ZOO_INFO@log_env@1250: Client environment:zookeeper.version=zookeeper C client 3.7.0
2022-07-28 16:25:15,412:63056(0x7f2f8d699700):ZOO_INFO@log_env@1254: Client environment:host.name=host
2022-07-28 16:25:15,412:63056(0x7f2f8d699700):ZOO_INFO@log_env@1261: Client environment:os.name=Linux
2022-07-28 16:25:15,412:63056(0x7f2f8d699700):ZOO_INFO@log_env@1262: Client environment:os.arch=3.18.6-2.el7.centos.x86_64
2022-07-28 16:25:15,412:63056(0x7f2f8d699700):ZOO_INFO@log_env@1263: Client environment:os.version=#1 SMP Mon Oct 24 13:01:33 CST 2016
2022-07-28 16:25:15,412:63056(0x7f2f8d699700):ZOO_INFO@log_env@1271: Client environment:user.name=(null)
2022-07-28 16:25:15,412:63056(0x7f2f8d699700):ZOO_INFO@log_env@1279: Client environment:user.home=/home/work
2022-07-28 16:25:15,412:63056(0x7f2f8d699700):ZOO_INFO@log_env@1291: Client environment:user.dir=/home/work/app/pegasus/tjwqtst-msg/meta
2022-07-28 16:25:15,412:63056(0x7f2f8d699700):ZOO_INFO@zookeeper_init_internal@1344: Initiating client connection, host=host:21000 sessionTimeout=10000 watcher=0x7f2fa97f1fc0 sessionId=0 sessionPasswd=<null> context=0x2082790 flags=0
2022-07-28 16:25:15,449:63056(0x7f2f7c652700):ZOO_INFO@check_events@2988: initiated connection to server IP:21000
2022-07-28 16:25:15,467:63056(0x7f2f7c652700):ZOO_INFO@finalize_session_establishment@2868: session establishment complete on server IP:21000, sessionId=0xff82392b354205db, negotiated timeout=10000
2022-07-28 16:25:47,814:63056(0x7f2f7c652700):ZOO_WARN@zookeeper_interest@2532: Exceeded deadline by 1021ms
2022-07-28 16:26:13,404:63056(0x7f2f7c652700):ZOO_WARN@zookeeper_interest@2532: Exceeded deadline by 1021ms
2022-07-28 16:26:29,014:63056(0x7f2f7c652700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known 

In server side, zk find some connection was closed by client:

2022-07-28,16:26:07,920 INFO org.apache.zookeeper.server.NIOServerCnxnFactory:[myid:0] Accepted socket connection 
ip:port
2022-07-28,16:26:07,924 INFO org.apache.zookeeper.server.ZooKeeperserver:[myid:0] Client attempting to renew session 0xff82392b30fa062d at ip:port
2022-07-28,16:26:07,924 INFO org.apache.zookeeper.server.quorum.Learner:[myid:@] Revalidating client: 0xff82392b30fa062d
2022-07-28,16:26:07,924 INFO org.apache.zookeeper.server.ZooKeeperServer:[myid:0] Established session 0xff82392b3ofa062d with negotiated timeout 10000 for client ip:port
2022-07-28,16:26:15,675 WARN org.apache.zookeeper.server.NIOServercnxn:[myid:0] caught end of stream exception
EndofstreamException: Unable to read additional data from client sessionid oxff82392b30fa062d, likely client has closed socket 
               at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServercnxn.java:224))
               at org.apache.zookeeper.server.NIoServerCnxnFactory.run(NIOServercnxnFactory.java:213)
               at java.lang.Thread.run(Thread.java:748)
2022-07-28,16:26:15,676 INFO org.apache.zookeeper.server.NIOServercnxn:[myid:0] Closed socket connection for client
ip:port which had session id 0xff82392b30fa062d

The above bug does not always occur, I only find the bug in one cluster of our company, The following resolution will be ok:

Related jira:

https://issues.apache.org/jira/browse/ZOOKEEPER-4603

acelyc111 commented 2 years ago

@foreverneverer have you checked the zookeeper server side log?

acelyc111 commented 2 years ago

meta server log and zk client log are not at the same time, I'm not sure if they are related. Could you paste the 3 parties logs and make sure they are at the same time?

acelyc111 commented 2 years ago

This issuse seems related to this problem, https://issues.apache.org/jira/browse/ZOOKEEPER-1998. We can try to set a delay_ms by API zoo_set_servers_resolution_delay, or add a .patch file for zk thirdparty lib to revert this commit.

foreverneverer commented 2 years ago

meta server log and zk client log are not at the same time, I'm not sure if they are related. Could you paste the 3 parties logs and make sure they are at the same time?

The log was copied from different time, but the log detail is same, I don't re-produce the log

foreverneverer commented 2 years ago

@foreverneverer have you checked the zookeeper server side log?

added

foreverneverer commented 2 years ago

This issuse seems related to this problem, https://issues.apache.org/jira/browse/ZOOKEEPER-1998. We can try to set a delay_ms by API zoo_set_servers_resolution_delay, or add a .patch file for zk thirdparty lib to revert this commit.

The issue seem say getaddrinfo is unconditionally in old version, and fixed in 3.7, that's also mean, it has existed in old version but not added in new version?

acelyc111 commented 2 years ago

This issuse seems related to this problem, https://issues.apache.org/jira/browse/ZOOKEEPER-1998. We can try to set a delay_ms by API zoo_set_servers_resolution_delay, or add a .patch file for zk thirdparty lib to revert this commit.

The issue seem say getaddrinfo is unconditionally in old version, and fixed in 3.7, that's also mean, it has existed in old version but not added in new version?

This issue exists in all zk client versions, this patch just add a new API (i.e. zoo_set_servers_resolution_delay, see https://github.com/apache/zookeeper/pull/1068/files#diff-d7c3594993d19853c9d2b312b783be085bd9f02b3f5fe8a573e7c94e0ec22734R725) to set the delay time, I'm not sure if it can resolve the problem you faced, but you can have a try.

foreverneverer commented 2 years ago

This issue exists in all zk client versions, this patch just add a new API (i.e. zoo_set_servers_resolution_delay, see https://github.com/apache/zookeeper/pull/1068/files#diff-d7c3594993d19853c9d2b312b783be085bd9f02b3f5fe8a573e7c94e0ec22734R725) to set the delay time, I'm not sure if it can resolve the problem you faced, but you can have a try.

OK, how add .patch file? I can try it

GehaFearless commented 2 years ago

This issue exists in all zk client versions, this patch just add a new API (i.e. zoo_set_servers_resolution_delay, see https://github.com/apache/zookeeper/pull/1068/files#diff-d7c3594993d19853c9d2b312b783be085bd9f02b3f5fe8a573e7c94e0ec22734R725) to set the delay time, I'm not sure if it can resolve the problem you faced, but you can have a try.

OK, how add .patch file? I can try it

zk c client 3.7.0 already have this patch . I add config on https://github.com/apache/incubator-pegasus/pull/1100/files, try it

foreverneverer commented 2 years ago

This issue exists in all zk client versions, this patch just add a new API (i.e. zoo_set_servers_resolution_delay, see https://github.com/apache/zookeeper/pull/1068/files#diff-d7c3594993d19853c9d2b312b783be085bd9f02b3f5fe8a573e7c94e0ec22734R725) to set the delay time, I'm not sure if it can resolve the problem you faced, but you can have a try.

OK, how add .patch file? I can try it

zk c client 3.7.0 already have this patch . I add config on https://github.com/apache/incubator-pegasus/pull/1100/files, try it

Great!

But I have test it and it doesn't work well. the log:

Pegasus:

E2022-08-09 16:16:14.503 (1660032974503742256 148273)   meta.THREAD_POOL_META_SERVER0.020200020000034e: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)
D2022-08-09 16:16:14.503 (1660032974503750762 148294)   meta.io-thrd.148294: zookeeper_session.cpp:323:global_watcher(): global watcher, type(session event), state(expired_session_state)
E2022-08-09 16:16:14.503 (1660032974503777787 148294)   meta.io-thrd.148294: distributed_lock_service_zookeeper.cpp:282:on_zoo_session_evt(): get zoo state: expired_session_state, which means the session is expired
E2022-08-09 16:16:14.503 (1660032974503805295 148274)   meta.THREAD_POOL_META_SERVER1.0202000000000315: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)
E2022-08-09 16:16:14.503 (1660032974503820448 148274)   meta.THREAD_POOL_META_SERVER1.0202000200000350: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)
E2022-08-09 16:16:14.503 (1660032974503828983 148274)   meta.THREAD_POOL_META_SERVER1.0202000300000337: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)
E2022-08-09 16:16:14.503 (1660032974503836417 148274)   meta.THREAD_POOL_META_SERVER1.0202000300000338: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)
E2022-08-09 16:16:14.503 (1660032974503849575 148289)   meta.     fd0.0201000000000005: meta_server_failure_detector.cpp:162:operator()(): leader lock expired callback: err(ERR_EXPIRED), owner(ip:port), version(174)
E2022-08-09 16:16:14.503 (1660032974503877813 148274)   meta.THREAD_POOL_META_SERVER1.0202000300000339: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)

Zookeeper:

2022-08-09 16:15:05,844:148195(0x7faadc489700):ZOO_INFO@log_env@1250: Client environment:zookeeper.version=zookeeper C client 3.7.0
2022-08-09 16:15:05,844:148195(0x7faadc489700):ZOO_INFO@log_env@1254: Client environment:host.name=host
2022-08-09 16:15:05,844:148195(0x7faadc489700):ZOO_INFO@log_env@1261: Client environment:os.name=Linux
2022-08-09 16:15:05,844:148195(0x7faadc489700):ZOO_INFO@log_env@1262: Client environment:os.arch=3.18.6-2.el7.centos.x86_64
2022-08-09 16:15:05,844:148195(0x7faadc489700):ZOO_INFO@log_env@1263: Client environment:os.version=#1 SMP Mon Oct 24 13:01:33 CST 2016
2022-08-09 16:15:05,845:148195(0x7faadc489700):ZOO_INFO@log_env@1271: Client environment:user.name=(null)
2022-08-09 16:15:05,845:148195(0x7faadc489700):ZOO_INFO@log_env@1279: Client environment:user.home=/home/work
2022-08-09 16:15:05,845:148195(0x7faadc489700):ZOO_INFO@log_env@1291: Client environment:user.dir=/home/work/app/pegasus/tjwqtst-msg/meta
2022-08-09 16:15:05,845:148195(0x7faadc489700):ZOO_INFO@zookeeper_init_internal@1344: Initiating client connection, host=host:port sessionTimeout=10000 watcher=0x7faaf85e4460 sessionId=0 sessionPasswd=<null> context=0x3ef8790 flags=0
2022-08-09 16:15:05,852:148195(0x7faacb442700):ZOO_INFO@check_events@2988: initiated connection to server ip:port
2022-08-09 16:15:05,858:148195(0x7faacb442700):ZOO_INFO@finalize_session_establishment@2868: session establishment complete on server ip:port, sessionId=0xff82392b30fa55e5, negotiated timeout=10000
2022-08-09 16:15:44,869:148195(0x7faacb442700):ZOO_WARN@zookeeper_interest@2532: Exceeded deadline by 749ms
2022-08-09 16:15:48,962:148195(0x7faacb442700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known

2022-08-09 16:15:48,962:148195(0x7faacb442700):ZOO_WARN@zookeeper_interest@2532: Exceeded deadline by 684ms
2022-08-09 16:16:14,488:148195(0x7faacb442700):ZOO_ERROR@handle_socket_error_msg@3007: Socket ip:port zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2022-08-09 16:16:14,496:148195(0x7faacb442700):ZOO_INFO@check_events@2988: initiated connection to server ip:port
2022-08-09 16:16:14,501:148195(0x7faacb442700):ZOO_ERROR@handle_socket_error_msg@3043: Socket ip:port zk retcode=-112, errno=116(Stale file handle): sessionId=0xff82392b30fa55e5 has expired.
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1250: Client environment:zookeeper.version=zookeeper C client 3.7.0
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1254: Client environment:host.name=host
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1261: Client environment:os.name=Linux
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1262: Client environment:os.arch=3.18.6-2.el7.centos.x86_64
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1263: Client environment:os.version=#1 SMP Mon Oct 24 13:01:33 CST 2016
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1271: Client environment:user.name=(null)
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1279: Client environment:user.home=/home/work
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1291: Client environment:user.dir=/home/work/app/pegasus/tjwqtst-msg/meta
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@zookeeper_init_internal@1344: Initiating client connection, host=host:port sessionTimeout=10000 watcher=0x7ff5c362f460 sessionId=0 sessionPasswd=<null> context=0x215a790 flags=0
2022-08-09 16:16:15,603:149127(0x7ff59648d700):ZOO_INFO@check_events@2988: initiated connection to server ip:port
2022-08-09 16:16:15,609:149127(0x7ff59648d700):ZOO_INFO@finalize_session_establishment@2868: session establishment complete on server ip:port, sessionId=0xff82392b3542553b, negotiated timeout=10000