apache / incubator-pegasus

Apache Pegasus - A horizontally scalable, strongly consistent and high-performance key-value store
https://pegasus.apache.org/
Apache License 2.0
1.96k stars 310 forks source link

Question(FQDN): how to connect to Pegasus cluster with FQDN pegasus version #2007

Open ninsmiracle opened 1 month ago

ninsmiracle commented 1 month ago

General Question

When I deploy master branch of pegasus to real cluster, I could not connect to peagsus via peagsus_shell.

  1. Firstly , I change all the IP to hostname in pegasus config
  2. Then I deloy it to machines
  3. I connected to peagsus cluster via admlin-cli,such as use this command ./admin-cli -n aaa:25101,bbb:25101,but return fatal: failed to list nodes [context deadline exceeded]
  4. I connected to pegasus cluster via pegasus-shell. It works. However,when I type nodes -d ,cluster crash.

stdout(error log) in meta server:

I2024-05-08 14:13:57.603 (1715148837603905326 81668) : pegasus server starting, pid(81668), version($Version: Pegasus Server 2.6.0-SNAPSHOT (aea1cfe632d455fcddfe4c92ebbd9d4e89037abb) Release, built by gcc 7.3.1, built on 12180ab51819, built at May  7 2024 12:14:31 $)
F2024-05-08 14:15:26.215 (1715148926215608204 81749)   meta.THREAD_POOL_META_SERVER3.02003f3d00010001: 

rpc_host_port.cpp:62:from_address(): assertion expression: [utils::hostname_from_ip(__bswap_32 (addr.ip()), &hp._host)] invalid host_port 172.17.0.1

172.17.0.1 is my pegasus-shell IP , which is in a docker. It looks like peagsus can not resolve this IP correctly, it's a bug?

I also fonud these coredump in replica servers.

Program terminated with signal SIGABRT, Aborted.
#0  0x00007ffaedff01d7 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffaedff01d7 in raise () from /lib64/libc.so.6
#1  0x00007ffaedff18c8 in abort () from /lib64/libc.so.6
#2  0x00007ffaf240ca1e in dsn_coredump () at /home/guoningshen/code/incubator-pegasus/src/runtime/service_api_c.cpp:130
#3  0x00007ffaef3e8134 in process_fatal_log (log_level=<optimized out>) at /home/guoningshen/code/incubator-pegasus/src/utils/simple_logger.cpp:117
#4  dsn::tools::simple_logger::log (this=0x1a38200, file=<optimized out>, function=<optimized out>, line=<optimized out>, log_level=<optimized out>, str=<optimized out>)
    at /home/guoningshen/code/incubator-pegasus/src/utils/simple_logger.cpp:284
#5  0x00007ffaf21ec19b in dsn::replication::replica_stub::open_replica (this=0x1851800, app=..., id=..., group_check=..., configuration_update=...)
    at /home/guoningshen/code/incubator-pegasus/src/replica/replica_stub.cpp:1817
#6  0x00007ffaf2447be1 in dsn::task::exec_internal (this=0x1f50b40) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task.cpp:173
#7  0x00007ffaf245f257 in dsn::task_worker::loop (this=0x1717290) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task_worker.cpp:245
#8  0x00007ffaf245fdc0 in dsn::task_worker::run_internal (this=0x1717290) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task_worker.cpp:225
#9  0x00007ffaf0ed9a3f in execute_native_thread_routine () from /home/work/app/pegasus/c3tst-performance1/replica/package/bin/librocksdb.so.8
#10 0x00007ffaef66edc5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007ffaee0b273d in clone () from /lib64/libc.so.6
(gdb)
ninsmiracle commented 1 month ago

So I want to know what should I do , to deloy a peagsus cluster with FQDN now , and how to use tools control this cluster. Thanks a lot. @acelyc111

ninsmiracle commented 1 month ago

Let me add more details:

  1. deploy clusters,it works. Every nodes running...

  2. useing peagsus-shell to connected to cluster image

  3. send any RPC command , like nodes -dr or ls -d. TIME_OUT image

4.A lot of core in meta-server image

Core like core.meta.THREAD_PO...

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/home/work/app/pegasus/c3tst-performance1/meta/package/bin/pegasus_server confi'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f3c0c8bc1d7 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f3c0c8bc1d7 in raise () from /lib64/libc.so.6
#1  0x00007f3c0c8bd8c8 in abort () from /lib64/libc.so.6
#2  0x00007f3c10cd8a1e in dsn_coredump () at /home/guoningshen/code/incubator-pegasus/src/runtime/service_api_c.cpp:130
#3  0x00007f3c0dcb4134 in process_fatal_log (log_level=<optimized out>) at /home/guoningshen/code/incubator-pegasus/src/utils/simple_logger.cpp:117
#4  dsn::tools::simple_logger::log (this=0x2e3a200, file=<optimized out>, function=<optimized out>, line=<optimized out>, log_level=<optimized out>, str=<optimized out>)
    at /home/guoningshen/code/incubator-pegasus/src/utils/simple_logger.cpp:284
#5  0x00007f3c10d09ff3 in dsn::host_port::from_address (addr=...) at /home/guoningshen/code/incubator-pegasus/src/runtime/rpc/rpc_host_port.cpp:60
#6  0x00007f3c10d0f0c5 in dsn::message_ex::create_response (this=this@entry=0x327be00) at /home/guoningshen/code/incubator-pegasus/src/runtime/rpc/rpc_message.cpp:358
#7  0x00007f3c10d0638d in dsn::rpc_engine::forward (this=this@entry=0x2c4f180, request=request@entry=0x327be00, address=...) at /home/guoningshen/code/incubator-pegasus/src/runtime/rpc/rpc_engine.cpp:853
#8  0x00007f3c10cd90a3 in dsn_rpc_forward (request=0x327be00, addr=...) at /home/guoningshen/code/incubator-pegasus/src/runtime/service_api_c.cpp:207
#9  0x00007f3c0ffc6196 in forward (addr=..., this=0x7f3bee4e5f20) at /home/guoningshen/code/incubator-pegasus/src/runtime/rpc/rpc_holder.h:224
#10 dsn::replication::meta_service::check_leader<dsn::rpc_holder<dsn::replication::configuration_list_apps_request, dsn::replication::configuration_list_apps_response> > (this=this@entry=0x32ee000, 
    rpc=..., forward_address=<optimized out>) at /home/guoningshen/code/incubator-pegasus/src/meta/meta_service.h:406
#11 0x00007f3c0ffc629a in dsn::replication::meta_service::check_leader_status<dsn::rpc_holder<dsn::replication::configuration_list_apps_request, dsn::replication::configuration_list_apps_response> > (
    this=this@entry=0x32ee000, rpc=..., forward_address=forward_address@entry=0x0) at /home/guoningshen/code/incubator-pegasus/src/meta/meta_service.h:420
#12 0x00007f3c0ff9ef6a in dsn::replication::meta_service::on_list_apps (this=0x32ee000, rpc=...) at /home/guoningshen/code/incubator-pegasus/src/meta/meta_service.cpp:671
#13 0x00007f3c0fff8653 in operator() (request=<optimized out>, __closure=<optimized out>) at /home/guoningshen/code/incubator-pegasus/src/runtime/serverlet.h:201
#14 std::_Function_handler<void (dsn::message_ex*), bool dsn::serverlet<dsn::replication::meta_service>::register_rpc_handler_with_rpc_holder<dsn::rpc_holder<dsn::replication::configuration_list_apps_request, dsn::replication::configuration_list_apps_response> >(dsn::task_code, char const*, void (dsn::replication::meta_service::*)(dsn::rpc_holder<dsn::replication::configuration_list_apps_request, dsn::replication::configuration_list_apps_response>))::{lambda(dsn::message_ex*)#1}>::_M_invoke(std::_Any_data const&, dsn::message_ex*&&) (__functor=..., __args#0=<optimized out>)
    at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:316
#15 0x00007f3c10d123b2 in operator() (__args#0=<optimized out>, this=0x2b310d0) at /opt/rh/devtoolset-7/root/usr/include/c++/7/bits/std_function.h:706
#16 dsn::rpc_request_task::exec (this=0x2b31000) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task.h:436
#17 0x00007f3c10d13be1 in dsn::task::exec_internal (this=0x2b31000) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task.cpp:173
#18 0x00007f3c10d2b257 in dsn::task_worker::loop (this=0x2b19290) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task_worker.cpp:245
#19 0x00007f3c10d2bdc0 in dsn::task_worker::run_internal (this=0x2b19290) at /home/guoningshen/code/incubator-pegasus/src/runtime/task/task_worker.cpp:225
#20 0x00007f3c0f7a5a3f in execute_native_thread_routine () from /home/work/app/pegasus/c3tst-performance1/meta/package/bin/librocksdb.so.8
#21 0x00007f3c0df3adc5 in start_thread () from /lib64/libpthread.so.0
#22 0x00007f3c0c97e73d in clone () from /lib64/libc.so.6
(gdb) 

Core like core.pegasus_server....

#0  0x0000000000000000 in ?? ()
#1  0x00007f693f83b6c0 in (anonymous namespace)::stacktrace_generic_fp::capture<false, false> (result=result@entry=0xaee010, max_depth=31, skip_count=1, initial_frame=initial_frame@entry=0x7ffd328eae80, 
    initial_pc=initial_pc@entry=0x0, sizes=0x0) at src/stacktrace_generic_fp-inl.h:175
#2  0x00007f693f83b74a in GetStackTrace_generic_fp (result=0xaee010, max_depth=<optimized out>, skip_count=<optimized out>) at src/stacktrace_generic_fp-inl.h:332
#3  0x00007f693f83ba52 in GetStackTrace (result=result@entry=0xaee010, max_depth=max_depth@entry=30, skip_count=skip_count@entry=0) at src/stacktrace.cc:346
#4  0x00007f693f82c37e in tcmalloc::PageHeap::HandleUnlock (this=0x7f693fa56720 <tcmalloc::Static::pageheap_>, context=0x7ffd328eaf10) at src/page_heap.cc:155
#5  0x00007f693f82e07a in ~LockingContext (this=0x7ffd328eaf10, __in_chrg=<optimized out>) at src/page_heap.cc:77
#6  tcmalloc::PageHeap::NewWithSizeClass (this=this@entry=0x7f693fa56720 <tcmalloc::Static::pageheap_>, n=n@entry=1, sizeclass=26) at src/page_heap.cc:161
#7  0x00007f693f82beb7 in tcmalloc::CentralFreeList::Populate (this=this@entry=0x7f693fbe1420 <tcmalloc::Static::central_cache_+31616>) at src/central_freelist.cc:314
#8  0x00007f693f82c088 in tcmalloc::CentralFreeList::FetchFromOneSpansSafe (this=0x7f693fbe1420 <tcmalloc::Static::central_cache_+31616>, N=1, start=0x7ffd328eb020, end=0x7ffd328eb028)
    at src/central_freelist.cc:273
#9  0x00007f693f82c120 in tcmalloc::CentralFreeList::RemoveRange (this=0x7f693fbe1420 <tcmalloc::Static::central_cache_+31616>, start=start@entry=0x7ffd328eb020, end=end@entry=0x7ffd328eb028, N=1)
    at src/central_freelist.cc:253
#10 0x00007f693f82fca3 in tcmalloc::ThreadCache::FetchFromCentralCache (this=this@entry=0xb0e000, cl=cl@entry=26, byte_size=byte_size@entry=576, 
    oom_handler=oom_handler@entry=0x7f693f81d240 <(anonymous namespace)::nop_oom_handler(size_t)>) at src/thread_cache.cc:125
#11 0x00007f693f83f15d in Allocate (oom_handler=0x7f693f81d240 <(anonymous namespace)::nop_oom_handler(size_t)>, cl=26, size=576, this=<optimized out>) at src/thread_cache.h:381
#12 do_malloc (size=568) at src/tcmalloc.cc:1414
#13 do_allocate_full<tcmalloc::malloc_oom> (size=568) at src/tcmalloc.cc:1804
#14 tcmalloc::allocate_full_malloc_oom (size=568) at src/tcmalloc.cc:1820
#15 0x00007f693dfa754d in __fopen_internal () from /lib64/libc.so.6
#16 0x00007f693ca60a16 in selinuxfs_exists () from /lib64/libselinux.so.1
#17 0x00007f693ca58ce8 in init_lib () from /lib64/libselinux.so.1
#18 0x00007f6943dfd1e3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#19 0x00007f6943def21a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#20 0x0000000000000004 in ?? ()
#21 0x00007ffd328ed220 in ?? ()
#22 0x00007ffd328ed26a in ?? ()
#23 0x00007ffd328ed275 in ?? ()
#24 0x00007ffd328ed27f in ?? ()
#25 0x0000000000000000 in ?? ()
(gdb) 
  1. stdout(error log) in meta-server
    W2024-05-11 10:33:36.503 (1715394816503732375 36348) : overwrite default thread pool for task RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX from THREAD_POOL_META_SERVER to THREAD_POOL_DEFAULT
    W2024-05-11 10:33:36.503 (1715394816503775340 36348) : overwrite default thread pool for task RPC_CM_QUERY_PARTITION_CONFIG_BY_INDEX_ACK from THREAD_POOL_META_SERVER to THREAD_POOL_DEFAULT
    I2024-05-11 10:33:36.503 (1715394816503863057 36348) : pegasus server starting, pid(36348), version($Version: Pegasus Server 2.6.0-SNAPSHOT (aea1cfe632d455fcddfe4c92ebbd9d4e89037abb) Release, built by gcc 7.3.1, built on 12180ab51819, built at May  7 2024 12:14:31 $)
    F2024-05-11 10:36:03.558 (1715394963558260142 36428)   meta.THREAD_POOL_META_SERVER2.02008e370001000c: rpc_host_port.cpp:62:from_address(): assertion expression: [utils::hostname_from_ip(__bswap_32 (addr.ip()), &hp._host)] invalid host_port 172.17.0.1

7.By the way , all the replica-server running during that time image

8.And I can not connect to cluster via admin-cli image

acelyc111 commented 1 month ago

Hi, @ninsmiracle !

Is the Pegasus cluster deployed as a onebox in the docker container? Do the Pegasus shell tool and admin-cli run in the same docker container?

ninsmiracle commented 1 month ago

Hi, @ninsmiracle !

Is the Pegasus cluster deployed as a onebox in the docker container? Do the Pegasus shell tool and admin-cli run in the same docker container?

When I deloyed as a onebox in my Docker container , cluster run as normal. However, if I deploy it on real node, cluster running but can not accept any RPC.
I think the key point is meta.THREAD_POOL_META_SERVER2.02008e370001000c: rpc_host_port.cpp:62:from_address(): assertion expression: [utils::hostname_from_ip(__bswap_32 (addr.ip()), &hp._host)] invalid host_port 172.17.0.1.

acelyc111 commented 1 month ago

I connected to peagsus cluster via admlin-cli,such as use this command ./admin-cli -n aaa:25101,bbb:25101,but return fatal: failed to list nodes [context deadline exceeded]

It's because after the main FQDN patch has been merged, a new Thrift structure (i.e. host_port) has been introduced, but the admin-cli side dosen't know this type. You can check it in the admin-cli's shell.log, the error looks like:

time="2024-05-23T00:30:55+08:00" level=info msg="failed to read response from [127.0.0.1:34601(meta)]: *admin.ListNodesResponse error reading struct: *admin.NodeInfo error reading struct: Unknown data type 57"

The resolution is to update the admin-cli dependent go-client. However, we have to resolve https://github.com/apache/incubator-pegasus/pull/1917 at first.

acelyc111 commented 2 weeks ago

Hi, @ninsmiracle ! Is the Pegasus cluster deployed as a onebox in the docker container? Do the Pegasus shell tool and admin-cli run in the same docker container?

When I deloyed as a onebox in my Docker container , cluster run as normal. However, if I deploy it on real node, cluster running but can not accept any RPC. I think the key point is meta.THREAD_POOL_META_SERVER2.02008e370001000c: rpc_host_port.cpp:62:from_address(): assertion expression: [utils::hostname_from_ip(__bswap_32 (addr.ip()), &hp._host)] invalid host_port 172.17.0.1.

@ninsmiracle You can check if this patch could solve the issue: https://github.com/apache/incubator-pegasus/pull/2044