dragonflydb / dragonfly

A modern replacement for Redis and Memcached
https://www.dragonflydb.io/
Other
24.52k stars 885 forks source link

Dragonfly (rarely) crashes on connection termination during migration #3139

Closed BorysTheDev closed 3 weeks ago

BorysTheDev commented 1 month ago

https://github.com/dragonflydb/dragonfly/actions/runs/9383656283/job/25837782989

30001➜ 12:45:11.228849 21502 fibers.cc:15] Check failed: !IsJoinable() 30001➜ Check failure stack trace: 30001➜ 12:45:11.228855 21503 fibers.cc:15] Check failed: !IsJoinable() F20240605 12:45:11.228960 21504 fibers.cc:15] Check failed: !IsJoinable() 30001➜ SIGABRT received at time=1717591511 on cpu 0 30001➜ Check failure stack trace: 30001➜ 12:45:11.228855 21503 fibers.cc:15] Check failed: !IsJoinable() F20240605 12:45:11.228960 21504 fibers.cc:15] Check failed: !IsJoinable() 30001➜ Check failure stack trace: 30001➜ : 345] RAW: Signal 6 raised at PC=0xffffb69aad78 while already in AbslFailureSignalHandler() 30001➜ @ 0xffffb69aad78 (unknown) raise 30001➜ @ 0xaaaacddb06c4 480 absl::lts20240116::AbslFailureSignalHandler() 30001➜ @ 0xffffb71d18bc 4960 (unknown) 30001➜ @ 0xffffb6997aac 304 abort 30001➜ @ 0xaaaacdd62224 336 google::DumpStackTraceAndExit() 30001➜ @ 0xaaaacdd55f24 192 google::LogMessage::Fail() 30001➜ : 345] RAW: Signal 6 raised at PC=0xffffb69aad78 while already in AbslFailureSignalHandler() 30001➜ @ 0xaaaacdd5c844 16 google::LogMessage::SendToLog() 30001➜ @ 0xaaaacdd55928 208 google::LogMessage::Flush() 30001➜ @ 0xaaaacdd57264 80 google::LogMessageFatal::~LogMessageFatal() 30001➜ @ 0xaaaacdcdbce4 16 util::fb2::Fiber::~Fiber() 30001➜ @ 0xaaaacd8b3890 144 dfly::RestoreStreamer::~RestoreStreamer() 30001➜ @ 0xaaaacd7c0cd8 32 dfly::cluster::OutgoingMigration::SliceSlotMigration::~SliceSlotMigration() 30001➜ @ 0xaaaacd7c04a0 32 util::fb2::detail::WorkerFiberImpl<>::run() 30001➜ @ 0xaaaacd7c07b8 288 boost::context::detail::fiber_entry<>()

chakaz commented 4 weeks ago

I think that this crash was not during shutdown, because the test logs show:

2024-06-05T12:53:41.2202706Z [2024-06-05 12:45:10.060 DEBUG] Start migration
2024-06-05T12:53:41.2207098Z [2024-06-05 12:45:10.060 DEBUG] Pushing config [{"slot_ranges": [{"start": 0, "end": 16383}], "master": {"id": "9aaf945f2fc4c333f9ce7ea4e4f5614e8c347366", "ip": "127.0.0.1", "port": 30001}, "replicas": [], "migrations": [{"slot_ranges": [{"start": 0, "end": 16383}], "node_id": "5bb7e6d57f0c70ee42af26ad5748046197a98940", "ip": "127.0.0.1", "port": 1111}]}, {"slot_ranges": [], "master": {"id": "5bb7e6d57f0c70ee42af26ad5748046197a98940", "ip": "127.0.0.1", "port": 30002}, "replicas": [], "migrations": []}]
2024-06-05T12:53:41.2211255Z [2024-06-05 12:45:10.213 DEBUG] drop connections
2024-06-05T12:53:41.2213148Z [2024-06-05 12:45:10.220 DEBUG] ['out 5bb7e6d57f0c70ee42af26ad5748046197a98940 SYNC keys:27199 errors: 0']
2024-06-05T12:53:41.2214619Z [2024-06-05 12:45:10.671 DEBUG] drop connections
2024-06-05T12:53:41.2216502Z [2024-06-05 12:45:10.672 DEBUG] ['out 5bb7e6d57f0c70ee42af26ad5748046197a98940 CONNECTING keys:27199 errors: Software caused connection abort']
2024-06-05T12:53:41.2218376Z [2024-06-05 12:45:10.722 DEBUG] drop connections
2024-06-05T12:53:41.2220222Z [2024-06-05 12:45:10.723 DEBUG] ['out 5bb7e6d57f0c70ee42af26ad5748046197a98940 CONNECTING keys:27199 errors: Software caused connection abort']
2024-06-05T12:53:41.2221990Z [2024-06-05 12:45:10.924 DEBUG] drop connections
2024-06-05T12:53:41.2223819Z [2024-06-05 12:45:10.924 DEBUG] ['out 5bb7e6d57f0c70ee42af26ad5748046197a98940 CONNECTING keys:27199 errors: Software caused connection abort']
2024-06-05T12:53:41.2225585Z [2024-06-05 12:45:11.229 DEBUG] drop connections

We see fewer than 10 prints and also the line remove finished migrations is not printed, so the test should not yet be shutting down