antirez / disque

Disque is a distributed message broker
BSD 3-Clause "New" or "Revised" License
8.01k stars 537 forks source link

Crash on network partitions #97

Closed aphyr closed 9 years ago

aphyr commented 9 years ago

A network partitions test (available on Jepsen a1938c734460f9180ab68177aafc299fb6a09f36) seems to reliably segfault Disque:

=== DISQUE BUG REPORT START: Cut & paste starting from here ===
12277:P 02 Jul 16:24:41.683 #     Disque 0.0.1 crashed by signal: 11
12277:P 02 Jul 16:24:41.683 #     Failed assertion: <no assertion failed> (<no file>:0)
12277:P 02 Jul 16:24:41.683 # --- STACK TRACE
/opt/disque/src/disque-server *:7711(logStackTrace+0x75)[0x425925]
/opt/disque/src/disque-server *:7711(gotAckReceived+0xa2)[0x431642]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7fd655d888d0]
/opt/disque/src/disque-server *:7711(gotAckReceived+0xa2)[0x431642]
/opt/disque/src/disque-server *:7711(clusterProcessPacket+0x8c3)[0x428b93]
/opt/disque/src/disque-server *:7711(clusterReadHandler+0x83)[0x428f13]
/opt/disque/src/disque-server *:7711(aeProcessEvents+0x133)[0x410083]
/opt/disque/src/disque-server *:7711(aeMain+0x2b)[0x41039b]
/opt/disque/src/disque-server *:7711(main+0x302)[0x40f192]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fd6559f1b45]
/opt/disque/src/disque-server *:7711[0x40f2a9]
12277:P 02 Jul 16:24:41.683 # --- INFO OUTPUT
12277:P 02 Jul 16:24:41.683 # # Server
disque_version:0.0.1
disque_git_sha1:5df8e1d7
disque_git_dirty:0
disque_build_id:284a4b8aa08564b1
os:Linux 3.16.0-4-amd64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.9.2
process_id:12277
run_id:704c6ed80ac68eccd052fd628cae460ed8a6ff6b
tcp_port:7711
uptime_in_seconds:71
uptime_in_days:0
hz:10
config_file:/opt/disque/disque.conf

# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:740848
used_memory_human:723.48K
used_memory_rss:3731456
used_memory_peak:777200
used_memory_peak_human:758.98K
mem_fragmentation_ratio:5.04
mem_allocator:jemalloc-3.6.0

# Jobs
registered_jobs:14

# Queues
registered_queues:1

# Persistence
loading:0
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_current_size:21824
aof_base_size:0
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

# Stats
total_connections_received:2
total_commands_processed:93
instantaneous_ops_per_sec:1
total_net_input_bytes:7331
total_net_output_bytes:4443
instantaneous_input_kbps:0.09
instantaneous_output_kbps:0.05
rejected_connections:0
latest_fork_usec:0

# CPU
used_cpu_sys:0.72
used_cpu_user:0.37
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Commandstats
cmdstat_cluster:calls=1,usec=12,usec_per_call=12.00
cmdstat_addjob:calls=37,usec=531,usec_per_call=14.35
cmdstat_getjob:calls=30,usec=424,usec_per_call=14.13
cmdstat_ackjob:calls=25,usec=241,usec_per_call=9.64
hash_init_value: 1436079788

12277:P 02 Jul 16:24:41.683 # --- CLIENT LIST OUTPUT
12277:P 02 Jul 16:24:41.683 # id=2 addr=192.168.122.1:41741 fd=18 name= age=67 idle=1 flags=N qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=ackjob

12277:P 02 Jul 16:24:41.683 # --- REGISTERS
12277:P 02 Jul 16:24:41.683 # 
RAX:0000000000000000 RBX:00007fd655014380
RCX:00007fd6550117a0 RDX:00000000000000e8
RDI:0000000000000000 RSI:00007fd65505d408
RBP:00007fd65505d3e0 RSP:00007fffdf310160
R8 :00007fd65505d408 R9 :00000000fffffff7
R10:000000298a2b10d0 R11:00007fd655b3ac70
R12:0000000000000001 R13:0000000000000000
R14:0000000000000003 R15:00007fd65505b850
RIP:0000000000431642 EFL:0000000000010246
CSGSFS:0000000000000033
12277:P 02 Jul 16:24:41.683 # (00007fffdf31016f) -> 00007fd65509e8a8
12277:P 02 Jul 16:24:41.683 # (00007fffdf31016e) -> 00007fd6550d05e0
12277:P 02 Jul 16:24:41.683 # (00007fffdf31016d) -> 0000000000416d31
12277:P 02 Jul 16:24:41.683 # (00007fffdf31016c) -> 0000000000000008
12277:P 02 Jul 16:24:41.683 # (00007fffdf31016b) -> 0000000000000008
12277:P 02 Jul 16:24:41.683 # (00007fffdf31016a) -> 0000000000000128
12277:P 02 Jul 16:24:41.683 # (00007fffdf310169) -> 0000000000418471
12277:P 02 Jul 16:24:41.683 # (00007fffdf310168) -> 00007fd65505f140
12277:P 02 Jul 16:24:41.683 # (00007fffdf310167) -> 0000000000000008
12277:P 02 Jul 16:24:41.683 # (00007fffdf310166) -> 0000000000000131
12277:P 02 Jul 16:24:41.683 # (00007fffdf310165) -> 0000000000428b93
12277:P 02 Jul 16:24:41.683 # (00007fffdf310164) -> 00007fd65505f148
12277:P 02 Jul 16:24:41.683 # (00007fffdf310163) -> 00007fd65505f158
12277:P 02 Jul 16:24:41.683 # (00007fffdf310162) -> 000000000000000a
12277:P 02 Jul 16:24:41.683 # (00007fffdf310161) -> 00007fd65505d3e0
12277:P 02 Jul 16:24:41.683 # (00007fffdf310160) -> 0000000000000001
12277:P 02 Jul 16:24:41.683 # --- FAST MEMORY TEST
12277:P 02 Jul 16:24:41.683 # Bio thread for job type #0 terminated
12277:P 02 Jul 16:24:41.684 # Bio thread for job type #1 terminated
12277:P 02 Jul 16:24:41.927 # Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.
12277:P 02 Jul 16:24:41.927 # 
=== DISQUE BUG REPORT END. Make sure to include from START to END. ===

       Please report the crash by opening an issue on github:

           http://github.com/antirez/disque/issues

  Suspect RAM error? Use disque-server --test-memory to verify it.
aphyr commented 9 years ago

Spoke too soon; looks like you might have fixed this between 5df8e1d7838d7bea0bd9cf187922a1469d1bb252 and f00dd0704128707f7a5effccd5837d796f2c01e3 :)

sunheehnus commented 9 years ago

Hi @aphyr , the bug was already fixed in https://github.com/antirez/disque/commit/53ef7b7eb3440c8fdf02e989e7bc83c82a7262c5 :-)

antirez commented 9 years ago

Pretty amazing this was reproducible with Jensen. The bug literally tortured us for weeks! And indeed involved a given sequence of messages in an order that is hard to get if the network acts in a reliable way. On Jul 3, 2015 4:12 AM, "Sun He" notifications@github.com wrote:

Hi @aphyr https://github.com/aphyr , the bug was already fixed in 53ef7b7 https://github.com/antirez/disque/commit/53ef7b7eb3440c8fdf02e989e7bc83c82a7262c5 :-)

— Reply to this email directly or view it on GitHub https://github.com/antirez/disque/issues/97#issuecomment-118211405.