apache / incubator-pegasus

Apache Pegasus - A horizontally scalable, strongly consistent and high-performance key-value store
https://pegasus.apache.org/
Apache License 2.0
1.96k stars 310 forks source link

Bug(duplication):some nodes coredump after start duplication for a long time #2014

Open ninsmiracle opened 1 month ago

ninsmiracle commented 1 month ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? -Deploy duplication matser and back-up cluster. -Begin duplicate. -Run about 2~3 days. -Some nodes coredump

  2. What did you expect to see? Node run as normal.

  3. What did you see instead? memory monitoring table. image

coredump detail:

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/home/work/app/pegasus/c3srv-browser/replica/package/bin/pegasus_server config.'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f01575401d7 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f01575401d7 in raise () from /lib64/libc.so.6
#1  0x00007f01575418c8 in abort () from /lib64/libc.so.6
#2  0x00007f015c628f9e in dsn_coredump () at /home/work/temp/format_pegasus/pegasus/src/rdsn/src/runtime/service_api_c.cpp:93

#3  0x00007f015c422c83 in dsn::replication::log_file::log_file (this=0x73aa4a630, path=0x740561c98 "/home/work/ssd2/pegasus/c3srv-browser/replica/reps/72.173.pegasus/plog/log.92534.3105061495163", 
    handle=<optimized out>, index=<optimized out>, start_offset=3105061495163, is_read=<optimized out>) at /home/work/temp/format_pegasus/pegasus/src/rdsn/src/replica/log_file.cpp:166

#4  0x00007f015c4247ce in dsn::replication::log_file::open_read (path=0x740561c98 "/home/work/ssd2/pegasus/c3srv-browser/replica/reps/72.173.pegasus/plog/log.92534.3105061495163", err=...)
    at /home/work/temp/format_pegasus/pegasus/src/rdsn/src/replica/log_file.cpp:92
#5  0x00007f015c43ccfa in dsn::replication::log_utils::open_read (path=..., file=...) at /home/work/temp/format_pegasus/pegasus/src/rdsn/src/replica/mutation_log_utils.cpp:43
#6  0x00007f015c4ff7fa in dsn::replication::load_from_private_log::find_log_file_to_start (this=this@entry=0x384c74640)
    at /home/work/temp/format_pegasus/pegasus/src/rdsn/src/replica/duplication/load_from_private_log.cpp:123
#7  0x00007f015c500360 in dsn::replication::load_from_private_log::run (this=0x384c74640) at /home/work/temp/format_pegasus/pegasus/src/rdsn/src/replica/duplication/load_from_private_log.cpp:100
#8  0x00007f015c665f91 in dsn::task::exec_internal (this=this@entry=0x2b9bce1e0) at /home/work/temp/format_pegasus/pegasus/src/rdsn/src/runtime/task/task.cpp:176
#9  0x00007f015c67b642 in dsn::task_worker::loop (this=0x2a67c30) at /home/work/temp/format_pegasus/pegasus/src/rdsn/src/runtime/task/task_worker.cpp:224
#10 0x00007f015c67b7c0 in dsn::task_worker::run_internal (this=0x2a67c30) at /home/work/temp/format_pegasus/pegasus/src/rdsn/src/runtime/task/task_worker.cpp:204
#11 0x00007f015b2f8a3f in execute_native_thread_routine () from /home/work/app/pegasus/c3srv-browser/replica/package/bin/libdsn_utils.so
#12 0x00007f0159103dc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f015760273d in clone () from /lib64/libc.so.6
(gdb)

stdout file (error log):

E2024-05-15 05:48:17.721 (1715723297721553663 62544) replica.rep_long9.040400031452989e: native_linux_aio_provider.cpp:49:open(): create file failed, err = No such file or directory

E2024-05-15 05:48:17.721 (1715723297721596680 62544) replica.rep_long9.040400031452989e: load_from_private_log.cpp:125:find_log_file_to_start(): [72.171@10.142.162.23:34801] ERR_FILE_OPERATION_FAILED: failed to open the log file (/home/work/ssd7/pegasus/c3srv-xxxxxx/replica/reps/72.171.pegasus/plog/log.91190.3060048707709)

F2024-05-15 06:03:20.656 (1715724200656901498 62545) replica.rep_long10.04040005181bcdaf: log_file.cpp:166:log_file(): assertion expression: false
F2024-05-15 06:03:20.656 (1715724200656954168 62545) replica.rep_long10.04040005181bcdaf: log_file.cpp:166:log_file(): fail to get file size of /home/work/ssd2/pegasus/c3srv-xxxxx/replica/reps/72.173.pegasus/plog/log.92534.3105061495163
  1. What version of Pegasus are you using? peagsus v2.4