Closed tobegit3hub closed 7 months ago
Error logs:
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20240412 03:12:52.201548 39639 util.cc:58] setting temp path for test in "/tmp/openmldb/new_server_env_test598400"
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NewServerEnvTest
[ RUN ] NewServerEnvTest.ShowRealEndpoint
I20240412 03:12:52.203043 39639 name_server_impl.cc:1414] zone name ns1/rtidb45250674
2024-04-12 03:12:52,203:39639(0x7f5ce6603500):ZOO_INFO@log_env@753: Client environment:zookeeper.version=zookeeper C client 3.4.14
2024-04-12 03:12:52,203:39639(0x7f5ce6603500):ZOO_INFO@log_env@757: Client environment:host.name=4b60f8b6dfcb
2024-04-12 03:12:52,203:39639(0x7f5ce6603500):ZOO_INFO@log_env@764: Client environment:os.name=Linux
2024-04-12 03:12:52,203:39639(0x7f5ce6603500):ZOO_INFO@log_env@765: Client environment:os.arch=3.10.0-862.el7.x86_64
2024-04-12 03:12:52,203:39639(0x7f5ce6603500):ZOO_INFO@log_env@766: Client environment:os.version=#1 SMP Fri Apr 20 16:44:24 UTC 2018
2024-04-12 03:12:52,203:39639(0x7f5ce6603500):ZOO_INFO@log_env@774: Client environment:user.name=(null)
2024-04-12 03:12:52,203:39639(0x7f5ce6603500):ZOO_INFO@log_env@782: Client environment:user.home=/root
2024-04-12 03:12:52,203:39639(0x7f5ce6603500):ZOO_INFO@log_env@794: Client environment:user.dir=/workspaces/OpenMLDB
2024-04-12 03:12:52,203:39639(0x7f5ce6603500):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=127.0.0.1:6181 sessionTimeout=100000 watcher=0x7e8420 sessionId=0 sessionPasswd=
/workspaces/OpenMLDB/src/nameserver/new_server_env_test.cc:180: Failure Expected equality of these values: ns_real_ep Which is: "127.0.0.1:9631" it->second Which is: "" I20240412 03:13:01.365576 39639 server.cpp:1194] Server[openmldb::tablet::TabletImpl] is going to quit 2024-04-12 03:13:01,366:39639(0x7f5ce6603500):ZOO_INFO@zookeeper_close@2564: Closing zookeeper sessionId=0x103614baed8000f to [127.0.0.1:6181]
I20240412 03:13:01.369341 39639 server.cpp:1194] Server[openmldb::tablet::TabletImpl] is going to quit I20240412 03:13:01.369550 39651 zk_client.cc:48] node watcher with event type 4, state 3 I20240412 03:13:01.369913 39651 zk_client.cc:170] handle node changed event with type 4, and state 3, endpoints size 1, callback size 1 I20240412 03:13:01.370280 39651 name_server_impl.cc:1170] healthy tablet with endpoint[tb1] I20240412 03:13:01.370349 39651 name_server_impl.cc:1176] offline tablet with endpoint[tb2] W20240412 03:13:01.370465 39641 rpc_client.h:61] error_code is EHOSTDOWN, sleep [1000] ms 2024-04-12 03:13:01,370:39639(0x7f5ce6603500):ZOO_INFO@zookeeper_close@2564: Closing zookeeper sessionId=0x103614baed8000e to [127.0.0.1:6181]
[ FAILED ] NewServerEnvTest.ShowRealEndpoint (9171 ms) [----------] 1 test from NewServerEnvTest (9171 ms total)
[----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (9171 ms total) I20240412 03:13:01.373283 39651 zk_client.cc:48] node watcher with event type 4, state 3 [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] NewServerEnvTest.ShowRealEndpoint
1 FAILED TEST I20240412 03:13:01.373620 39651 zk_client.cc:170] handle node changed event with type 4, and state 3, endpoints size 0, callback size 1 I20240412 03:13:01.373682 39651 name_server_impl.cc:1176] offline tablet with endpoint[tb1] I20240412 03:13:01.373965 39639 util.cc:68] removing temp path: "/tmp/openmldb/new_server_env_test598400"
root cause is W20240412 03:12:57.252521 39639 zk_client.cc:237] server name:ns1 duplicate
, check https://github.com/4paradigm/OpenMLDB/blob/40eaf505167b98cf82de18aedb17a35d67023845/src/zk/zk_client.cc#L210-L239
It'll get nodes(leader and tablets) in line 212-230, if leader node is exists in zk, sname_vec will have ns1
, so it'll check GetNodeValue(names_root_path_ + "/" + sname, ep) && ep == real_endpoint_
. But no ns1
in zk names_root_path_
, we haven't create it, GetNodeValue
return false. So registerName
says duplicate
and returns false.
I think the good way is check if exists in zk
names_rootpath `, don't just get, we can't figure out that it doesn't exsit or get failed(zk failures).
zk data use zkCli.sh -server xx:xx
, cheatsheet https://zookeeper.apache.org/doc/r3.6.0/zookeeperCLI.html
Bug Description
Now we may fail to start nameserver after adding the sleep method.
The issue may between
Init()
andRegisterName
. The functioninit
is asynchronous which is used to register zk watch. The functionregister
is used to write endpoint in zk but it requires to be exeucted beforeinit
finished.This is the wired design and occurs failure of starting nameserver.
Expected Behavior
Success to start nameserver whenever it sleeps.
Steps To Reproduce
new_server_env_test