antirez / disque

Disque is a distributed message broker
BSD 3-Clause "New" or "Revised" License
8.01k stars 537 forks source link

crash when running two server 1 replicate and memory is full #148

Closed mathieulongtin closed 8 years ago

mathieulongtin commented 8 years ago

I got my first crash, and I can reliably recreate this:

  1. Start two disque-server and have them cluster meet
  2. Connect to one of them (cluster meet...)
  3. Add jobs like this: addjob q1 PAYLOAD replicate=1. I used 10KB payload.
  4. Add jobs until memory is full, only one of the server is getting full
  5. When the server that is full runs out of memory, it crashes

I first observed it with a 1GB maxmemory, then again with 100MB, so it's not the physical's server's memory limitation.

If I don't specify replicate, it works as expected, when the servers run out of memory, I get this: NOREPL Timeout reached before replicating to the requested number of nodes

=== DISQUE BUG REPORT START: Cut & paste starting from here ===
28284:P 06 Jan 16:52:15.412 #     Disque 1.0-rc1 crashed by signal: 11
28284:P 06 Jan 16:52:15.412 #     Failed assertion: <no assertion failed> (<no file>:0)
28284:P 06 Jan 16:52:15.412 # --- STACK TRACE
./src/disque-server *:7770(logStackTrace+0x43)[0x42ab33]
./src/disque-server *:7770(dictAddRaw+0x14)[0x4135b4]
/lib64/libpthread.so.0[0x309d80f710]
./src/disque-server *:7770(dictAddRaw+0x14)[0x4135b4]
./src/disque-server *:7770(dictAdd+0x1e)[0x4137ae]
./src/disque-server *:7770(clusterProcessPacket+0x7ab)[0x42fbcb]
./src/disque-server *:7770(clusterReadHandler+0xdb)[0x43020b]
./src/disque-server *:7770(aeProcessEvents+0x13c)[0x4108fc]
./src/disque-server *:7770(aeMain+0x2b)[0x410bbb]
./src/disque-server *:7770(main+0x39d)[0x417d5d]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x309d41ed5d]
./src/disque-server *:7770[0x40ff19]
28284:P 06 Jan 16:52:15.414 # --- INFO OUTPUT
28284:P 06 Jan 16:52:15.414 # # Server
disque_version:1.0-rc1
disque_git_sha1:7dad5666
disque_git_dirty:0
disque_build_id:df2612766b0503d8
os:Linux 2.6.32-504.12.2.el6.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.4.7
process_id:28284
run_id:6114fb4470d99c4aea5098221d34471dde35caf3
tcp_port:7770
uptime_in_seconds:79
uptime_in_days:0
hz:10
executable:/.local/work/mlongtin/github/disque/./src/disque-server
config_file:/.local/work/mlongtin/github/disque/d7770/disque.conf

# Clients
connected_clients:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:75004472
used_memory_human:71.53M
used_memory_rss:83181568
used_memory_peak:75004472
used_memory_peak_human:71.53M
mem_fragmentation_ratio:1.11
mem_allocator:jemalloc-4.0.3

# Jobs
registered_jobs:6992

# Queues
registered_queues:1

# Persistence
loading:0
aof_enabled:0
aof_state:off
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:1
total_commands_processed:6994
instantaneous_ops_per_sec:573
total_net_input_bytes:70374510
total_net_output_bytes:301098
instantaneous_input_kbps:5633.30
instantaneous_output_kbps:24.07
rejected_connections:0
latest_fork_usec:0

# CPU
used_cpu_sys:2.75
used_cpu_user:1.36
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Commandstats
cmdstat_hello:calls=2,usec=29,usec_per_call=14.50
cmdstat_addjob:calls=6992,usec=270174,usec_per_call=38.64
hash_init_value: 1451989612

28284:P 06 Jan 16:52:15.414 # --- CLIENT LIST OUTPUT
28284:P 06 Jan 16:52:15.414 # id=1 addr=127.0.0.1:48795 fd=11 name= age=68 idle=0 flags=N qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=addjob

28284:P 06 Jan 16:52:15.414 # --- REGISTERS
28284:P 06 Jan 16:52:15.414 #
RAX:00007f675c90d800 RBX:0000000000000000
RCX:0000000000000000 RDX:00007f6762463000
RDI:0000000000000000 RSI:00007f6762463000
RBP:0000000000000000 RSP:00007ffff70de2c0
R8 :00000000b32ca4f0 R9 :00007f675c7b30bd
R10:00000000ffffffdc R11:0000000000000009
R12:00007f6762463000 R13:00007f676241e3f0
R14:0000000000000000 R15:0000000000000005
RIP:00000000004135b4 EFL:0000000000010206
CSGSFS:0000000000000033
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2cf) -> 0000000000000178
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2ce) -> 00007f675c7b3005
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2cd) -> 000000000042fbcb
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2cc) -> 00007f675c7b3005
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2cb) -> 00007f675c90d800
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2ca) -> 00007f6762463000
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2c9) -> 00000000004137ae
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2c8) -> 0000000000000005
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2c7) -> 0000000000000000
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2c6) -> 00007f676241e3f0
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2c5) -> 00007f6762463000
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2c4) -> 00007f675c90d800
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2c3) -> 0000000000000000
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2c2) -> 0000000000000180
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2c1) -> 00007f6762a00180
28284:P 06 Jan 16:52:15.414 # (00007ffff70de2c0) -> 000000000000000f
28284:P 06 Jan 16:52:15.414 # --- FAST MEMORY TEST
28284:P 06 Jan 16:52:15.416 # Bio thread for job type #0 terminated
28284:P 06 Jan 16:52:15.416 # Bio thread for job type #1 terminated
Testing 691000 90112
Testing 1e4a000 135168
Testing 309ce21000 4096
Testing 309d78f000 20480
Testing 309da19000 16384
Testing 7f675c600000 75497472
Testing 7f6760fff000 10485760
Testing 7f6761a00000 12582912
Testing 7f67626d0000 4096
Testing 7f6762a00000 2097152
Testing 7f6762c82000 16384
Testing 7f6762ca8000 8192
28284:P 06 Jan 16:52:16.845 # Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.
28284:P 06 Jan 16:52:16.845 #
=== DISQUE BUG REPORT END. Make sure to include from START to END. ===

       Please report the crash by opening an issue on github:

           http://github.com/antirez/disque/issues

  Suspect RAM error? Use disque-server --test-memory to verify it.

Segmentation fault
antirez commented 8 years ago

Ok I can reproduce this very easily. Fixing.

antirez commented 8 years ago

p.s. Thanks for submitting!

antirez commented 8 years ago

Fixed, thank you.