apache / incubator-pegasus

Apache Pegasus - A horizontally scalable, strongly consistent and high-performance key-value store
https://pegasus.apache.org/
Apache License 2.0
1.98k stars 313 forks source link

Better log failure handling instead of assert false #287

Closed hycdong closed 2 years ago

hycdong commented 5 years ago

触发环境

coredump

(gdb) bt
#0 0x0000003f852328a5 in raise () from /lib64/libc.so.6
#1 0x0000003f85234085 in abort () from /lib64/libc.so.6
#2 0x00007f93ada6125e in dsn_coredump () at /home/heyuchen/split/pegasus/rdsn/src/core/core/service_api_c.cpp:73
#3 0x00007f93ad93caee in dsn::replication::replica_stub::handle_log_failure (this=<optimized out>, err=...) at /home/heyuchen/split/pegasus/rdsn/src/dist/replication/lib/replica_stub.cpp:1962
#4 0x00007f93ad98eef5 in dsn::replication::replica::on_append_log_completed (this=0x7f920d1eac60, mu=..., err=..., size=<optimized out>)
at /home/heyuchen/split/pegasus/rdsn/src/dist/replication/lib/replica_2pc.cpp:526
#5 0x00007f93ada5f5b8 in operator() (__args#1=<optimized out>, __args#0=..., this=<optimized out>) at /home/heyuchen/toolchain/output/include/c++/4.8.2/functional:2464
#6 dsn::aio_task::exec (this=<optimized out>) at /home/heyuchen/split/pegasus/rdsn/include/dsn/tool-api/task.h:597
#7 0x00007f93ada5d1f9 in dsn::task::exec_internal (this=this@entry=0x7f8cb6f11a88) at /home/heyuchen/split/pegasus/rdsn/src/core/core/task.cpp:180
#8 0x00007f93adab1d9d in dsn::task_worker::loop (this=0x2305f00) at /home/heyuchen/split/pegasus/rdsn/src/core/core/task_worker.cpp:211
#9 0x00007f93adab1f69 in dsn::task_worker::run_internal (this=0x2305f00) at /home/heyuchen/split/pegasus/rdsn/src/core/core/task_worker.cpp:191
#10 0x00007f93ab431600 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at /home/heyuchen/toolchain/objdir/../gcc-4.8.2/libstdc++-v3/src/c++11/thread.cc:84
#11 0x0000003f85607851 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003f852e811d in clone () from /lib64/libc.so.6
(gdb)

相关日志

E2019-02-20 18:30:54.767 (1550658654767145919 3211) replica.replica7.04050007005c7e26: native_aio_provider.linux.cpp:218:aio_internal(): io_submit error, ret = -11
E2019-02-20 18:30:54.767 (1550658654767175695 31ec) replica.default2.040100010017b3a7: mutation_log.cpp:193:operator()(): write shared log failed, err = ERR_FILE_OPERATION_FAILED
E2019-02-20 18:30:54.767 (1550658654767210218 31ef) replica.default5.04050001008ef578: mutation_log.cpp:457:operator()(): write private log failed, err = ERR_FILE_OPERATION_FAILED
E2019-02-20 18:30:54.767 (1550658654767285730 31ee) replica.default4.04050014007090e7: mutation_log.cpp:457:operator()(): write private log failed, err = ERR_FILE_OPERATION_FAILED
E2019-02-20 18:30:54.767 (1550658654767310415 31eb) replica.default1.040500170074149b: mutation_log.cpp:457:operator()(): write private log failed, err = ERR_FILE_OPERATION_FAILED
E2019-02-20 18:30:54.767 (1550658654767357562 31f0) replica.default6.040500150068bf23: mutation_log.cpp:457:operator()(): write private log failed, err = ERR_FILE_OPERATION_FAILED
E2019-02-20 18:30:54.767 (1550658654767400994 321e) replica.replica20.0405001400709062: native_aio_provider.linux.cpp:218:aio_internal(): io_submit error, ret = -11

解决方案思考

acelyc111 commented 2 years ago

https://github.com/XiaoMi/rdsn/pull/818