dragonflydb / dragonfly

A modern replacement for Redis and Memcached
https://www.dragonflydb.io/
Other
25.8k stars 948 forks source link

Crash when rename set #3107

Closed fernandomacho closed 4 months ago

fernandomacho commented 5 months ago

I have a 3 nodes running dragonfly 1 master and 2 replicas. The access to dragonfly instances is over haproxy with 2 backend: 1 for write (master) and 1 for read (master/replica)

The kernel version on servers are 5.15.0-73-generic on all servers.

The hardware is: Master: 96 cores and 376 Gb ram Replica 1: 48 cores and 256 Gb ram Replica 1: 48 cores and 256 Gb ram

DragronFly version v1.18.1

When rename set key the server crash. I try to reproduce this error on standalone dragonfly and I can reproduce.

On syslog get this error: May 31 10:25:03 XXXXXX dragonfly[360266]: F20240531 10:25:03.987241 360279 compact_object.cc:868] Check failed: 0U == u_.r_obj.type() (0 vs. 2) May 31 10:25:03 XXXXXX dragonfly[360266]: *** Check failure stack trace: *** May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608ddbcf343 google::LogMessage::SendToLog() May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608ddbc7b07 google::LogMessage::Flush() May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608ddbc948f google::LogMessageFatal::~LogMessageFatal() May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608dd4d4cd7 dfly::CompactObj::GetSlice() May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608dd1ce3aa dfly::(anonymous namespace)::Renamer::UpdateDest() May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608dd1ce9c0 _ZN4absl12lts_2024011619functional_internal12InvokeObjectIZN4dfly12_GLOBAL__N_17Renamer8FinalizeEPNS3_11TransactionEbEUlS7_PNS3_11EngineShardEE0_NS6_14RunnableResultEJS7_S9_EEET0_NS1_7VoidPtrEDpNS1_8ForwardTIT1_E4typeE May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608dd4b24ca dfly::Transaction::RunCallback() May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608dd4b4eeb dfly::Transaction::RunInShard() May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608dd3bc6a2 dfly::EngineShard::PollExecution() May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608dd4addc0 _ZNSt17_Function_handlerIFvvEZN4dfly11Transaction11DispatchHopEvEUlvE1_E9_M_invokeERKSt9_Any_data May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608dd4edb2b dfly::TaskQueue::TaskLoop() May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608dd4eddf0 _ZN5boost7context6detail11fiber_entryINS1_12fiber_recordINS0_5fiberEN4util3fb219FixedStackAllocatorEZNS6_6detail15WorkerFiberImplIZN4dfly9TaskQueue5StartESt17basic_string_viewIcSt11char_traitsIcEEEUlvE_JEEC4IS7_EESF_RKNS0_12preallocatedEOT_OSG_EUlOS4_E_EEEEvNS1_10transfer_tE May 31 10:25:03 XXXXXX dragonfly[360266]: @ 0x5608dd9dd67f make_fcontext May 31 10:25:03 XXXXXX dragonfly[360266]: *** SIGABRT received at time=1717143903 on cpu 3 *** May 31 10:25:03 XXXXXX dragonfly[360266]: [symbolize_elf.inc : 1383] RAW: /usr/lib/x86_64-linux-gnu/libc-2.31.so (deleted): open failed: errno=2 May 31 10:25:03 XXXXXX dragonfly[360266]: PC: @ 0x7f652446d00b (unknown) (unknown) May 31 10:25:04 XXXXXX systemd[1]: dragonfly.service: Main process exited, code=killed, status=6/ABRT May 31 10:25:04 XXXXXX systemd[1]: dragonfly.service: Failed with result 'signal'. May 31 10:25:04 XXXXXX systemd[1]: dragonfly.service: Scheduled restart job, restart counter is at 3. May 31 10:25:04 XXXXXX systemd[1]: Stopped Modern and fast key-value store. May 31 10:25:04 XXXXXX systemd[1]: Started Modern and fast key-value store. May 31 10:25:04 XXXXXX dragonfly[1753712]: * Logs will be written to the first available of the following paths: May 31 10:25:04 XXXXXX dragonfly[1753712]: /var/log/dragonfly/dragonfly.* May 31 10:25:04 XXXXXX dragonfly[1753712]: * For the available flags type dragonfly [--help | --helpfull] May 31 10:25:04 XXXXXX dragonfly[1753712]: * Documentation can be found at: https://www.dragonflydb.io/docs

I attach the logs of dragonfly crash_2025_05_31.zip

kostasrim commented 5 months ago

Hi @fernandomacho

thank you for reporthing this.

I tried a few chores (create a set with sadd and then rename it, or even rename the command itself and all of them worked) but I couldn't reproduce. Could you give me the command sequence you used that lead to this crash?

romange commented 5 months ago

based on the code, it should be really simple to reproduce with Dragonfly replicated setup. Seems that the bug is inside if (es->journal()) { block in Renamer::UpdateDest function and it seems we never implemented the correct behavior for journaling with non string data types. @adiholden maybe we can use DUMP command similarly to slot migration? Assigning to you for further triaging.