antirez / disque

Disque is a distributed message broker
BSD 3-Clause "New" or "Revised" License
8.01k stars 537 forks source link

Segfault #113

Closed seppo0010 closed 8 years ago

seppo0010 commented 9 years ago

Got the following crash while creating a client library. The commands being executed related to JSCAN were:

"JSCAN" "0" "COUNT" "1000"
"JSCAN" "0" "COUNT" "1000" "QUEUE" "queue17"

No other JSCAN parameters were used. I did not run a full memory check yet.

19621:C 16 Aug 03:09:30.740 # Warning: no config file specified, using the default config. In order to specify a config file use ./src/disque-server /path/to/disque.conf
19621:P 16 Aug 03:09:30.742 # You requested maxclients of 10000 requiring at least 10032 max file descriptors.
19621:P 16 Aug 03:09:30.742 # Server can't set maximum open files to 10032 because of OS error: Operation not permitted.
19621:P 16 Aug 03:09:30.742 # Current maximum open files is 2048. maxclients has been reduced to 2016 to compensate for low ulimit. If you need higher maxclients increase 'ulimit -n'.
19621:P 16 Aug 03:09:30.743 * Node configuration loaded, I'm d1dbbca17a9759ccf377a24132a7f7dc380f2918
                                        Disque 0.0.1 (6dbfa4c1/0) 64 bit
          _ -                                                        
        .                               Port: 7711
        .    o    .                     PID: 19621
                 .                                                   
               -                              http://disque.io       

19621:P 16 Aug 03:09:30.743 # Server started, Disque version 0.0.1
19621:P 16 Aug 03:09:30.743 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
19621:P 16 Aug 03:09:30.743 * The server is now ready to accept connections on port 7711
19621:P 16 Aug 03:17:17.768 # 

=== DISQUE BUG REPORT START: Cut & paste starting from here ===
19621:P 16 Aug 03:17:17.768 #     Disque 0.0.1 crashed by signal: 11
19621:P 16 Aug 03:17:17.768 #     Failed assertion: <no assertion failed> (<no file>:0)
19621:P 16 Aug 03:17:17.768 # --- STACK TRACE
./src/disque-server *:7711(logStackTrace+0x33)[0x4297d3]
./src/disque-server *:7711(equalStringObjects+0x4)[0x4224d4]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f7ccb04a340]
./src/disque-server *:7711(equalStringObjects+0x4)[0x4224d4]
./src/disque-server *:7711(jscanCallback+0x20)[0x4320b0]
./src/disque-server *:7711(dictScan+0x198)[0x412f98]
./src/disque-server *:7711(jscanCommand+0x178)[0x434528]
./src/disque-server *:7711(call+0x50)[0x415440]
./src/disque-server *:7711(processCommand+0xf2)[0x416ea2]
./src/disque-server *:7711(processInputBuffer+0x97)[0x420627]
./src/disque-server *:7711(aeProcessEvents+0x250)[0x410610]
./src/disque-server *:7711(aeMain+0x2b)[0x41084b]
./src/disque-server *:7711(main+0x325)[0x40f615]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f7ccac96ec5]
./src/disque-server *:7711[0x40f739]
19621:P 16 Aug 03:17:17.770 # --- INFO OUTPUT
19621:P 16 Aug 03:17:17.770 # # Server
disque_version:0.0.1
disque_git_sha1:6dbfa4c1
disque_git_dirty:0
disque_build_id:63f6d787d079142c
os:Linux 3.8.11 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.8.4
process_id:19621
run_id:0b271fe02f5f2fad3fef2380e8ab1c420029b2a7
tcp_port:7711
uptime_in_seconds:467
uptime_in_days:0
hz:10
config_file:

# Clients
connected_clients:4
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:500888
used_memory_human:489.15K
used_memory_rss:1814528
used_memory_peak:500888
used_memory_peak_human:489.15K
mem_fragmentation_ratio:3.62
mem_allocator:jemalloc-3.6.0

# Jobs
registered_jobs:23

# Queues
registered_queues:11

# Persistence
loading:0
aof_enabled:0
aof_state:off
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok

# Stats
total_connections_received:25
total_commands_processed:62
instantaneous_ops_per_sec:0
total_net_input_bytes:3691
total_net_output_bytes:9925
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
latest_fork_usec:0

# CPU
used_cpu_sys:8.53
used_cpu_user:6.64
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Commandstats
cmdstat_info:calls=1,usec=65,usec_per_call=65.00
cmdstat_monitor:calls=1,usec=5,usec_per_call=5.00
cmdstat_hello:calls=1,usec=5,usec_per_call=5.00
cmdstat_addjob:calls=26,usec=484,usec_per_call=18.62
cmdstat_getjob:calls=6,usec=272,usec_per_call=45.33
cmdstat_ackjob:calls=3,usec=15,usec_per_call=5.00
cmdstat_fastack:calls=3,usec=14,usec_per_call=4.67
cmdstat_deljob:calls=1,usec=8,usec_per_call=8.00
cmdstat_jscan:calls=18,usec=207,usec_per_call=11.50
cmdstat_enqueue:calls=1,usec=12,usec_per_call=12.00
cmdstat_dequeue:calls=1,usec=7,usec_per_call=7.00
hash_init_value: 1440444137

19621:P 16 Aug 03:17:17.770 # --- CLIENT LIST OUTPUT
19621:P 16 Aug 03:17:17.770 # id=12 addr=127.0.0.1:42917 fd=9 name= age=170 idle=162 flags=N qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=jscan
id=22 addr=127.0.0.1:42979 fd=12 name= age=0 idle=0 flags=N qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=getjob
id=24 addr=127.0.0.1:42981 fd=11 name= age=0 idle=0 flags=N qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=jscan
id=25 addr=127.0.0.1:42982 fd=10 name= age=0 idle=0 flags=N qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=hello

19621:P 16 Aug 03:17:17.770 # --- CURRENT CLIENT INFO
19621:P 16 Aug 03:17:17.771 # client: id=24 addr=127.0.0.1:42981 fd=11 name= age=0 idle=0 flags=N qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=jscan
19621:P 16 Aug 03:17:17.771 # argv[0]: 'JSCAN'
19621:P 16 Aug 03:17:17.771 # argv[1]: '0'
19621:P 16 Aug 03:17:17.771 # argv[2]: 'COUNT'
19621:P 16 Aug 03:17:17.771 # argv[3]: '1000'
19621:P 16 Aug 03:17:17.771 # argv[4]: 'QUEUE'
19621:P 16 Aug 03:17:17.771 # argv[5]: 'queue16'
19621:P 16 Aug 03:17:17.771 # --- REGISTERS
19621:P 16 Aug 03:17:17.771 # 
RAX:0000000000000000 RBX:00007ffc5ececdc0
RCX:0000000000000000 RDX:0000000000000007
RDI:0000000000000000 RSI:00007f7cca414aa0
RBP:00007f7cca416680 RSP:00007ffc5ececcf0
R8 :aaaaaaaaaaaaaaaa R9 :000000000000001f
R10:000000000000000f R11:00007f7ccadfb870
R12:00007f7cca41c430 R13:00007f7cca410940
R14:00007f7cca49a000 R15:000000000000001f
RIP:00000000004224d4 EFL:0000000000010206
CSGSFS:0000000000000033
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcff) -> 00007f7cca49a000
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcfe) -> 0000000000000006
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcfd) -> 00007f7cca41c3d0
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcfc) -> 00007f7cca41c430
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcfb) -> 00000000000026fc
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcfa) -> 00007f7cca41c430
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcf9) -> 00007f7cca49a000
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcf8) -> 0000000000000006
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcf7) -> ffffffffffffffe0
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcf6) -> 0000000000000030
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcf5) -> 0000000000412f98
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcf4) -> 0000000000000005
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcf3) -> 00007ffc5ececdb0
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcf2) -> 0000000000432090
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcf1) -> 00000000004320b0
19621:P 16 Aug 03:17:17.771 # (00007ffc5ececcf0) -> ffffffffffffffff
19621:P 16 Aug 03:17:17.771 # --- FAST MEMORY TEST
19621:P 16 Aug 03:17:17.771 # Bio thread for job type #0 terminated
19621:P 16 Aug 03:17:17.771 # Bio thread for job type #1 terminated
Testing 685000 86016
Testing 794000 135168
Testing 7f7cc93ff000 8388608
Testing 7f7cc9c00000 16777216
Testing 7f7ccb035000 20480
Testing 7f7ccb254000 16384
Testing 7f7ccb769000 16384
Testing 7f7ccb77c000 16384
Testing 7f7ccb782000 4096
19621:P 16 Aug 03:17:18.173 # Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.
19621:P 16 Aug 03:17:18.173 # 
=== DISQUE BUG REPORT END. Make sure to include from START to END. ===

       Please report the crash by opening an issue on github:

           http://github.com/antirez/disque/issues

  Suspect RAM error? Use disque-server --test-memory to verify it.

[1]    19621 segmentation fault (core dumped)  ./src/disque-server
seppo0010 commented 9 years ago

I was able to reproduce consistently. It fails after multiple ackjob are issued for the same job.

127.0.0.1:7711> addjob queue1 job1 0
DId1dbbca16b854c7ec98332a4019bcbf143791c3f05a0SQ
127.0.0.1:7711> ackjob DId1dbbca16b854c7ec98332a4019bcbf143791c3f05a0SQ
(integer) 1
127.0.0.1:7711> jscan 0 queue queue1
1) "0"
2) (empty list or set)
127.0.0.1:7711> ackjob DId1dbbca16b854c7ec98332a4019bcbf143791c3f05a0SQ
(integer) 0
127.0.0.1:7711> jscan 0 queue queue1
Could not connect to Disque at 127.0.0.1:7711: Connection refused
sunheehnus commented 9 years ago

Hi @seppo0010 , there is a path may cause this. :-) When there is no job of that ID(already ack and delete in your case), ackjobCommand create a dummy job, but the job is created with its queue as NULL. ack.c-L216-217 job.c-L178 Then job.c-L1416 use the NULL(job->queue) in equalStringObjects, so there is the segfault.

sunheehnus commented 9 years ago

Try to fix it in https://github.com/antirez/disque/pull/114

@seppo0010 Thanks to your clear statement, this is solved quickly :-)

Koed00 commented 9 years ago

Just confirming that I encountered the same error in my tests. It only happens when using the JSCAN command after sending a duplicate ACK.

antirez commented 9 years ago

Thanks, going to merge @sunheehnus fix between today and tomorrow. Tomorrow I'll work all day to Disque btw so I'll check all the other issues. Thanks.

antirez commented 9 years ago

p.s. would be great to get ACKs about @sunheehnus fix being enough to stop the problem in @Koed00 and @seppo0010 tests.

sunheehnus commented 9 years ago

Yes, I guess this situation is more likely to happen in a one-node cluster :-)

Koed00 commented 9 years ago

I have some time to run the pull request against my tests. Will report back.

antirez commented 9 years ago

Thanks!

Koed00 commented 9 years ago

Works great. Tests are passing again. Carry on.

antirez commented 9 years ago

:fireworks: