Open foreverneverer opened 2 years ago
@foreverneverer have you checked the zookeeper server side log?
meta server log and zk client log are not at the same time, I'm not sure if they are related. Could you paste the 3 parties logs and make sure they are at the same time?
This issuse seems related to this problem, https://issues.apache.org/jira/browse/ZOOKEEPER-1998.
We can try to set a delay_ms by API zoo_set_servers_resolution_delay
, or add a .patch file for zk thirdparty lib to revert this commit.
meta server log and zk client log are not at the same time, I'm not sure if they are related. Could you paste the 3 parties logs and make sure they are at the same time?
The log was copied from different time, but the log detail is same, I don't re-produce the log
@foreverneverer have you checked the zookeeper server side log?
added
This issuse seems related to this problem, https://issues.apache.org/jira/browse/ZOOKEEPER-1998. We can try to set a delay_ms by API
zoo_set_servers_resolution_delay
, or add a .patch file for zk thirdparty lib to revert this commit.
The issue seem say getaddrinfo
is unconditionally in old version, and fixed in 3.7, that's also mean, it has existed in old version but not added in new version?
This issuse seems related to this problem, https://issues.apache.org/jira/browse/ZOOKEEPER-1998. We can try to set a delay_ms by API
zoo_set_servers_resolution_delay
, or add a .patch file for zk thirdparty lib to revert this commit.The issue seem say
getaddrinfo
is unconditionally in old version, and fixed in 3.7, that's also mean, it has existed in old version but not added in new version?
This issue exists in all zk client versions, this patch just add a new API (i.e. zoo_set_servers_resolution_delay
, see https://github.com/apache/zookeeper/pull/1068/files#diff-d7c3594993d19853c9d2b312b783be085bd9f02b3f5fe8a573e7c94e0ec22734R725) to set the delay time, I'm not sure if it can resolve the problem you faced, but you can have a try.
This issue exists in all zk client versions, this patch just add a new API (i.e.
zoo_set_servers_resolution_delay
, see https://github.com/apache/zookeeper/pull/1068/files#diff-d7c3594993d19853c9d2b312b783be085bd9f02b3f5fe8a573e7c94e0ec22734R725) to set the delay time, I'm not sure if it can resolve the problem you faced, but you can have a try.
OK, how add .patch file? I can try it
This issue exists in all zk client versions, this patch just add a new API (i.e.
zoo_set_servers_resolution_delay
, see https://github.com/apache/zookeeper/pull/1068/files#diff-d7c3594993d19853c9d2b312b783be085bd9f02b3f5fe8a573e7c94e0ec22734R725) to set the delay time, I'm not sure if it can resolve the problem you faced, but you can have a try.OK, how add .patch file? I can try it
zk c client 3.7.0 already have this patch . I add config on https://github.com/apache/incubator-pegasus/pull/1100/files, try it
This issue exists in all zk client versions, this patch just add a new API (i.e.
zoo_set_servers_resolution_delay
, see https://github.com/apache/zookeeper/pull/1068/files#diff-d7c3594993d19853c9d2b312b783be085bd9f02b3f5fe8a573e7c94e0ec22734R725) to set the delay time, I'm not sure if it can resolve the problem you faced, but you can have a try.OK, how add .patch file? I can try it
zk c client 3.7.0 already have this patch . I add config on https://github.com/apache/incubator-pegasus/pull/1100/files, try it
Great!
But I have test it and it doesn't work well. the log:
Pegasus:
E2022-08-09 16:16:14.503 (1660032974503742256 148273) meta.THREAD_POOL_META_SERVER0.020200020000034e: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)
D2022-08-09 16:16:14.503 (1660032974503750762 148294) meta.io-thrd.148294: zookeeper_session.cpp:323:global_watcher(): global watcher, type(session event), state(expired_session_state)
E2022-08-09 16:16:14.503 (1660032974503777787 148294) meta.io-thrd.148294: distributed_lock_service_zookeeper.cpp:282:on_zoo_session_evt(): get zoo state: expired_session_state, which means the session is expired
E2022-08-09 16:16:14.503 (1660032974503805295 148274) meta.THREAD_POOL_META_SERVER1.0202000000000315: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)
E2022-08-09 16:16:14.503 (1660032974503820448 148274) meta.THREAD_POOL_META_SERVER1.0202000200000350: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)
E2022-08-09 16:16:14.503 (1660032974503828983 148274) meta.THREAD_POOL_META_SERVER1.0202000300000337: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)
E2022-08-09 16:16:14.503 (1660032974503836417 148274) meta.THREAD_POOL_META_SERVER1.0202000300000338: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)
E2022-08-09 16:16:14.503 (1660032974503849575 148289) meta. fd0.0201000000000005: meta_server_failure_detector.cpp:162:operator()(): leader lock expired callback: err(ERR_EXPIRED), owner(ip:port), version(174)
E2022-08-09 16:16:14.503 (1660032974503877813 148274) meta.THREAD_POOL_META_SERVER1.0202000300000339: server_state.cpp:644:operator()(): get partition node failed, reason(ERR_TIMEOUT)
Zookeeper:
2022-08-09 16:15:05,844:148195(0x7faadc489700):ZOO_INFO@log_env@1250: Client environment:zookeeper.version=zookeeper C client 3.7.0
2022-08-09 16:15:05,844:148195(0x7faadc489700):ZOO_INFO@log_env@1254: Client environment:host.name=host
2022-08-09 16:15:05,844:148195(0x7faadc489700):ZOO_INFO@log_env@1261: Client environment:os.name=Linux
2022-08-09 16:15:05,844:148195(0x7faadc489700):ZOO_INFO@log_env@1262: Client environment:os.arch=3.18.6-2.el7.centos.x86_64
2022-08-09 16:15:05,844:148195(0x7faadc489700):ZOO_INFO@log_env@1263: Client environment:os.version=#1 SMP Mon Oct 24 13:01:33 CST 2016
2022-08-09 16:15:05,845:148195(0x7faadc489700):ZOO_INFO@log_env@1271: Client environment:user.name=(null)
2022-08-09 16:15:05,845:148195(0x7faadc489700):ZOO_INFO@log_env@1279: Client environment:user.home=/home/work
2022-08-09 16:15:05,845:148195(0x7faadc489700):ZOO_INFO@log_env@1291: Client environment:user.dir=/home/work/app/pegasus/tjwqtst-msg/meta
2022-08-09 16:15:05,845:148195(0x7faadc489700):ZOO_INFO@zookeeper_init_internal@1344: Initiating client connection, host=host:port sessionTimeout=10000 watcher=0x7faaf85e4460 sessionId=0 sessionPasswd=<null> context=0x3ef8790 flags=0
2022-08-09 16:15:05,852:148195(0x7faacb442700):ZOO_INFO@check_events@2988: initiated connection to server ip:port
2022-08-09 16:15:05,858:148195(0x7faacb442700):ZOO_INFO@finalize_session_establishment@2868: session establishment complete on server ip:port, sessionId=0xff82392b30fa55e5, negotiated timeout=10000
2022-08-09 16:15:44,869:148195(0x7faacb442700):ZOO_WARN@zookeeper_interest@2532: Exceeded deadline by 749ms
2022-08-09 16:15:48,962:148195(0x7faacb442700):ZOO_ERROR@resolve_hosts@947: getaddrinfo: Name or service not known
2022-08-09 16:15:48,962:148195(0x7faacb442700):ZOO_WARN@zookeeper_interest@2532: Exceeded deadline by 684ms
2022-08-09 16:16:14,488:148195(0x7faacb442700):ZOO_ERROR@handle_socket_error_msg@3007: Socket ip:port zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2022-08-09 16:16:14,496:148195(0x7faacb442700):ZOO_INFO@check_events@2988: initiated connection to server ip:port
2022-08-09 16:16:14,501:148195(0x7faacb442700):ZOO_ERROR@handle_socket_error_msg@3043: Socket ip:port zk retcode=-112, errno=116(Stale file handle): sessionId=0xff82392b30fa55e5 has expired.
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1250: Client environment:zookeeper.version=zookeeper C client 3.7.0
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1254: Client environment:host.name=host
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1261: Client environment:os.name=Linux
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1262: Client environment:os.arch=3.18.6-2.el7.centos.x86_64
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1263: Client environment:os.version=#1 SMP Mon Oct 24 13:01:33 CST 2016
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1271: Client environment:user.name=(null)
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1279: Client environment:user.home=/home/work
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@log_env@1291: Client environment:user.dir=/home/work/app/pegasus/tjwqtst-msg/meta
2022-08-09 16:16:15,594:149127(0x7ff5a74d4700):ZOO_INFO@zookeeper_init_internal@1344: Initiating client connection, host=host:port sessionTimeout=10000 watcher=0x7ff5c362f460 sessionId=0 sessionPasswd=<null> context=0x215a790 flags=0
2022-08-09 16:16:15,603:149127(0x7ff59648d700):ZOO_INFO@check_events@2988: initiated connection to server ip:port
2022-08-09 16:16:15,609:149127(0x7ff59648d700):ZOO_INFO@finalize_session_establishment@2868: session establishment complete on server ip:port, sessionId=0xff82392b3542553b, negotiated timeout=10000
Bug Report
https://github.com/XiaoMi/rdsn/pull/979 upgrade zk c client to 3.7(from 3.4.10) for kerberos, but it sometime can't access zk server 3.4.5
In my test, when I upgrade the meta to latest pegasus server, the meta error log:
ERR_INCONSISTENT_STATE
is from zk errorZRUNTIMEINCONSISTENCY
, see the zk error log:In server side, zk find some connection was closed by client:
The above bug does not always occur, I only find the bug in one cluster of our company, The following resolution will be ok:
Related jira:
https://issues.apache.org/jira/browse/ZOOKEEPER-4603