Tencent / Tendis

Tendis is a high-performance distributed storage system fully compatible with the Redis protocol.
http://tendis.cn
Other
2.87k stars 317 forks source link

Tendis Crash 2.6.0-rocksdb-v6.23.3 #256

Closed githubname1024 closed 6 months ago

githubname1024 commented 6 months ago

Description

一个集群中的tendis实例出现crash现象。且为多个不同的实例,貌似均是执行unlink操作触发。

Context

E1204 17:52:50.212414 41686 main.cpp:123] Failure: Aborted at 1701683570 (unix time) try "date -d @1701683570" if you are using GNU date E1204 17:52:50.214108 41686 main.cpp:123] Failure:PC: @ 0x0 (unknown) E1204 17:52:50.214155 41686 main.cpp:123] Failure: SIGSEGV (@0x734c6d6f) received by PID 22382 (TID 0x7fe2d247f700) from PID 1934388591; stack trace: E1204 17:52:50.214543 41686 main.cpp:123] Failure: @ 0xb94656 google::(anonymous namespace)::FailureSignalHandler() E1204 17:52:50.215852 41686 main.cpp:123] Failure: @ 0x7fe9e7f54630 (unknown) E1204 17:52:50.216778 41686 main.cpp:123] Failure: @ 0x71fac1 tendisplus::RocksTxn::getKV() E1204 17:52:50.217926 41686 main.cpp:123] Failure: @ 0x71f97d tendisplus::RocksKVStore::getKV() E1204 17:52:50.218617 41686 main.cpp:123] Failure: @ 0x4c80dd tendisplus::Command::delKey() E1204 17:52:50.219727 41686 main.cpp:123] Failure: @ 0x5451e5 _ZZN10tendisplus13UnlinkCommand3runEPNS_7SessionEENKUlS2_OSt6vectorISsSaISsEEOSt4listISt10unique_ptrINS_7KeyLockESt14default_deleteIS9_EESaISC_EEE_clES2_S6SF E1204 17:52:50.220088 41686 main.cpp:123] Failure: @ 0xba506f execute_native_thread_routine E1204 17:52:50.221326 41686 main.cpp:123] Failure: @ 0x7fe9e7f4cea5 start_thread E1204 17:52:50.222783 41686 main.cpp:123] Failure: @ 0x7fe9e776bb0d __clone E1204 17:52:50.224059 41686 main.cpp:123] Failure: @ 0x0 (unknown)

E1206 20:13:05.370237 42889 main.cpp:123] Failure: Aborted at 1701864785 (unix time) try "date -d @1701864785" if you are using GNU date E1206 20:13:05.371451 42889 main.cpp:123] Failure:PC: @ 0x0 (unknown) E1206 20:13:05.371500 42889 main.cpp:123] Failure: SIGSEGV (@0x0) received by PID 33399 (TID 0x7fd6060e2700) from PID 0; stack trace: E1206 20:13:05.371872 42889 main.cpp:123] Failure: @ 0xb94656 google::(anonymous namespace)::FailureSignalHandler() E1206 20:13:05.372869 42889 main.cpp:123] Failure: @ 0x7fd6720d5630 (unknown) E1206 20:13:05.373888 42889 main.cpp:123] Failure: @ 0x713044 tendisplus::RocksTxn::getSnapshot() E1206 20:13:05.374840 42889 main.cpp:123] Failure: @ 0x71fac7 tendisplus::RocksTxn::getKV() E1206 20:13:05.376049 42889 main.cpp:123] Failure: @ 0x71f97d tendisplus::RocksKVStore::getKV() E1206 20:13:05.376744 42889 main.cpp:123] Failure: @ 0x4c80dd tendisplus::Command::delKey() E1206 20:13:05.377862 42889 main.cpp:123] Failure: @ 0x5451e5 _ZZN10tendisplus13UnlinkCommand3runEPNS_7SessionEENKUlS2_OSt6vectorISsSaISsEEOSt4listISt10unique_ptrINS_7KeyLockESt14default_deleteIS9_EESaISC_EEE_clES2_S6SF E1206 20:13:05.378228 42889 main.cpp:123] Failure: @ 0xba506f execute_native_thread_routine E1206 20:13:05.379171 42889 main.cpp:123] Failure: @ 0x7fd6720cdea5 start_thread E1206 20:13:05.380297 42889 main.cpp:123] Failure: @ 0x7fd6718ecb0d __clone E1206 20:13:05.381263 42889 main.cpp:123] Failure: @ 0x0 (unknown)

E1207 09:48:18.388058 25699 main.cpp:123] Failure: Aborted at 1701913698 (unix time) try "date -d @1701913698" if you are using GNU date E1207 09:48:18.389228 25699 main.cpp:123] Failure:PC: @ 0x0 (unknown) E1207 09:48:18.389278 25699 main.cpp:123] Failure: SIGSEGV (@0x0) received by PID 41769 (TID 0x7fe75fe7f700) from PID 0; stack trace: E1207 09:48:18.390134 25699 main.cpp:123] Failure: @ 0xb94656 google::(anonymous namespace)::FailureSignalHandler() E1207 09:48:18.391121 25699 main.cpp:123] Failure: @ 0x7fed052a8630 (unknown) E1207 09:48:18.392452 25699 main.cpp:123] Failure: @ 0x714eba tendisplus::RocksTxn::del() E1207 09:48:18.393025 25699 main.cpp:123] Failure: @ 0x71be61 tendisplus::RocksTxn::delKV() E1207 09:48:18.393832 25699 main.cpp:123] Failure: @ 0x7141cc tendisplus::RocksKVStore::delKV() E1207 09:48:18.394538 25699 main.cpp:123] Failure: @ 0x4c6cda tendisplus::Command::partialDelSubKeys() E1207 09:48:18.395579 25699 main.cpp:123] Failure: @ 0x4c7e87 tendisplus::Command::delKeyOptimismInLock() E1207 09:48:18.396593 25699 main.cpp:123] Failure: @ 0x4c81ec tendisplus::Command::delKey() E1207 09:48:18.398151 25699 main.cpp:123] Failure: @ 0x5451e5 _ZZN10tendisplus13UnlinkCommand3runEPNS_7SessionEENKUlS2_OSt6vectorISsSaISsEEOSt4listISt10unique_ptrINS_7KeyLockESt14default_deleteIS9_EESaISC_EEE_clES2_S6SF E1207 09:48:18.398582 25699 main.cpp:123] Failure: @ 0xba506f execute_native_thread_routine E1207 09:48:18.399503 25699 main.cpp:123] Failure: @ 0x7fed052a0ea5 start_thread E1207 09:48:18.400653 25699 main.cpp:123] Failure: @ 0x7fed04abfb0d __clone E1207 09:48:18.401593 25699 main.cpp:123] Failure: @ 0x0 (unknown)

Your Environment

rocks.cache_index_and_filter_blocks:1 rocks.max_open_files:-1 binlog-save-logs:no binlog-send-bytes:16777216 binlog-enabled:yes binlog-using-defaultCF:no minBinlogKeepSec:3600 maxbinlogkeepnum:1 slaveBinlogKeepNum:1 binlogFileSizeMB:64 binlogFileSecs:1200 binlog-send-batch:256 executorthreadnum:16 netiothreadnum:3 rocks.compress_type:lz4 element-limit-for-single-delete:2048 element-limit-for-single-delete-zset:1024 executorworkpoolsize:8

tendis在运行时,出现了crash。请帮忙提供一下如何分析,或者是哪里有问题。谢谢

chenshi2023 commented 6 months ago

我也遇到类似问题 ,求解

raffertyyu commented 6 months ago

能简单说下负载情况和使用到的命令吗。 还有rocks-transaction-mode这个配置的值有修改吗

githubname1024 commented 6 months ago

@raffertyyu 您好,负载情况不高,有用redis-shake做redis到tendis的数据同步,所以有unlink命令。 rocks-transaction-mode这个是采用的默认值。 最新发生的crash,通过gdb获取到的堆栈信息如下,请查收确认。谢谢 bt.txt

takenliu commented 6 months ago

原因是unlink命令实现有bug。出现的前提是:复杂结构,如果field的个数大于1024,会启动单独的线程来执行unlink操作,这时有较大概率会触发事务为空的bug。下个版本修复。 当前版本可以在配置中用将unlink命令映射到del命令来解决:“mapping-command unlink del”。因为del命令对于field数量大于1024的情况采用deleteRange接口,且tendis是多线程在服务,所以改用del命令并不会造成服务器卡顿。 出现问题的代码: 8c35d0361d5ed62b691b929fee5f28cf