renatoc commented 9 years ago

=== DISQUE BUG REPORT START: Cut & paste starting from here === 16525:P 21 May 11:30:15.598 # ------------------------------------------------ 16525:P 21 May 11:30:15.598 # !!! Software Failure. Press left mouse button to continue 16525:P 21 May 11:30:15.598 # Guru Meditation: "Unknown encoding type" #object.c:319 16525:P 21 May 11:30:15.598 # (forcing SIGSEGV in order to print the stack trace) 16525:P 21 May 11:30:15.598 # ------------------------------------------------ 16525:P 21 May 11:30:15.598 # Disque 0.0.1 crashed by signal: 11 16525:P 21 May 11:30:15.598 # Failed assertion: (:0) 16525:P 21 May 11:30:15.598 # --- STACK TRACE /usr/local/bin/disque-server 0.0.0.0:7711(logStackTrace+0x3e)[0x4269be] /usr/local/bin/disque-server 0.0.0.0:7711(_serverPanic+0x7f)[0x42625f] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f2766663cb0] /usr/local/bin/disque-server 0.0.0.0:7711(_serverPanic+0x7f)[0x42625f] /usr/local/bin/disque-server 0.0.0.0:7711(getDecodedObject+0xbf)[0x41f86f] /usr/local/bin/disque-server 0.0.0.0:7711(dictEncObjHash+0x2c)[0x412ddc] /usr/local/bin/disque-server 0.0.0.0:7711(dictAddRaw+0x3b)[0x411f2b] /usr/local/bin/disque-server 0.0.0.0:7711(dictAdd+0x1e)[0x41208e] /usr/local/bin/disque-server 0.0.0.0:7711(clusterReplicateJob+0x76)[0x42ade6] /usr/local/bin/disque-server 0.0.0.0:7711(clientsCronHandleDelayedJobReplication+0x47)[0x42ff47] /usr/local/bin/disque-server 0.0.0.0:7711(clientsCron+0x8a)[0x413a9a] /usr/local/bin/disque-server 0.0.0.0:7711(serverCron+0xe1)[0x415001] /usr/local/bin/disque-server 0.0.0.0:7711(aeProcessEvents+0x20b)[0x4101cb] /usr/local/bin/disque-server 0.0.0.0:7711(aeMain+0x2b)[0x4103db] /usr/local/bin/disque-server 0.0.0.0:7711(main+0x302)[0x40f162] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f27662b776d] /usr/local/bin/disque-server 0.0.0.0:7711[0x40f27d] 16525:P 21 May 11:30:15.599 # --- INFO OUTPUT 16525:P 21 May 11:30:15.601 # # Server disque_version:0.0.1 disque_git_sha1:9ce5cb67 disque_git_dirty:0 disque_build_id:adcb5342e3e5a0ed os:Linux 3.2.0-69-virtual x86_64 arch_bits:64 multiplexing_api:epoll gcc_version:4.6.3 process_id:16525 run_id:c625c8ab6ce9cacc9cb54e04cb06752fff8f15af tcp_port:7711 uptime_in_seconds:341 uptime_in_days:0 hz:10 config_file:/etc/disque/7711/disque.conf

Clients

connected_clients:14639 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:14639

Memory

used_memory:373456928 used_memory_human:356.16M used_memory_rss:173277184 used_memory_peak:380637792 used_memory_peak_human:363.00M mem_fragmentation_ratio:0.46 mem_allocator:jemalloc-3.6.0

Jobs

registered_jobs:54

Queues

registered_queues:1001

Persistence

loading:0 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok

Stats

total_connections_received:375779 total_commands_processed:928483 instantaneous_ops_per_sec:60 total_net_input_bytes:189129863 total_net_output_bytes:250909533 instantaneous_input_kbps:1.73 instantaneous_output_kbps:8.10 rejected_connections:0 latest_fork_usec:0

CPU

used_cpu_sys:49.50 used_cpu_user:62.80 used_cpu_sys_children:0.00 used_cpu_user_children:0.00

Commandstats

cmdstat_cluster:calls=2,usec=35,usec_per_call=17.50 cmdstat_hello:calls=375777,usec=1500012,usec_per_call=3.99 cmdstat_addjob:calls=184231,usec=1505720,usec_per_call=8.17 cmdstat_getjob:calls=191546,usec=796183,usec_per_call=4.16 cmdstat_ackjob:calls=176927,usec=929356,usec_per_call=5.25 hash_init_value: 1432287713

16525:P 21 May 11:30:15.601 # --- CLIENT LIST OUTPUT 16525:P 21 May 11:30:15.642 # id=375574 addr=127.0.0.1:13080 fd=12204 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=353307 addr=127.0.0.1:55631 fd=5862 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353308 addr=127.0.0.1:55632 fd=5986 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353309 addr=127.0.0.1:55634 fd=6008 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353310 addr=127.0.0.1:55636 fd=6024 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353312 addr=127.0.0.1:55638 fd=8346 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353314 addr=127.0.0.1:55640 fd=9665 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353317 addr=127.0.0.1:55643 fd=10474 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353320 addr=127.0.0.1:55646 fd=12737 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353323 addr=127.0.0.1:55650 fd=13208 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353324 addr=127.0.0.1:55649 fd=13307 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353327 addr=127.0.0.1:55657 fd=13799 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353329 addr=127.0.0.1:55664 fd=738 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob ... ... ... id=375560 addr=127.0.0.1:13059 fd=10645 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375561 addr=127.0.0.1:13060 fd=10886 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375562 addr=127.0.0.1:13061 fd=10889 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375563 addr=127.0.0.1:13063 fd=11165 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375564 addr=127.0.0.1:13067 fd=11182 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375565 addr=127.0.0.1:13070 fd=11335 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375566 addr=127.0.0.1:13071 fd=11586 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375567 addr=127.0.0.1:13072 fd=11620 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375568 addr=127.0.0.1:13073 fd=11822 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375569 addr=127.0.0.1:13074 fd=11847 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375570 addr=127.0.0.1:13076 fd=11853 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375571 addr=127.0.0.1:13077 fd=11945 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375572 addr=127.0.0.1:13078 fd=12093 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375573 addr=127.0.0.1:13079 fd=12186 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob

16525:P 21 May 11:30:15.644 # --- REGISTERS 16525:P 21 May 11:30:15.645 # RAX:0000000000000000 RBX:00000000004621f8 RCX:00007f276664d778 RDX:0000000000000000 RDI:0000000000020000 RSI:00007f276664d778 RBP:00000000004621c7 RSP:00007fff8d35d700 R8 :00007f276664d770 R9 :0000000000000001 R10:0000000000000000 R11:0000000000000206 R12:000000000000013f R13:00007f274eac5980 R14:0000000000000002 R15:0000000000000000 RIP:000000000042625f EFL:0000000000010206 CSGSFS:000000000000e033 16525:P 21 May 11:30:15.645 # (00007fff8d35d70f) -> 000000000043cf2d 16525:P 21 May 11:30:15.645 # (00007fff8d35d70e) -> 00007f2765c000c0 16525:P 21 May 11:30:15.645 # (00007fff8d35d70d) -> 00007f274e000000 16525:P 21 May 11:30:15.645 # (00007fff8d35d70c) -> 000000000000000c 16525:P 21 May 11:30:15.645 # (00007fff8d35d70b) -> 0000000000412ddc 16525:P 21 May 11:30:15.645 # (00007fff8d35d70a) -> 00007f274eac5980 16525:P 21 May 11:30:15.645 # (00007fff8d35d709) -> ddf8e1c14afef100 16525:P 21 May 11:30:15.645 # (00007fff8d35d708) -> 00007f2759b07148 16525:P 21 May 11:30:15.645 # (00007fff8d35d707) -> 0000000000410401 16525:P 21 May 11:30:15.645 # (00007fff8d35d706) -> 0000000000000001 16525:P 21 May 11:30:15.645 # (00007fff8d35d705) -> ddf8e1c14afef100 16525:P 21 May 11:30:15.645 # (00007fff8d35d704) -> 00007f2765c000c0 16525:P 21 May 11:30:15.645 # (00007fff8d35d703) -> 000000000041f86f 16525:P 21 May 11:30:15.645 # (00007fff8d35d702) -> 00007f276585d0c0 16525:P 21 May 11:30:15.645 # (00007fff8d35d701) -> 00007f274eac5980 16525:P 21 May 11:30:15.645 # (00007fff8d35d700) -> 00007f274eac5980 16525:P 21 May 11:30:15.645 # --- FAST MEMORY TEST 16525:P 21 May 11:30:15.645 # Bio thread for job type #0 terminated 16525:P 21 May 11:30:15.645 # Bio thread for job type #1 terminated 16525:P 21 May 11:30:19.644 # Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible. 16525:P 21 May 11:30:19.644 # === DISQUE BUG REPORT END. Make sure to include from START to END. ===

antirez commented 9 years ago

Thanks @renatoc, did you saw this multiple times running your script? I hope it repeats itself from time to time in order to gain a bit more context.

renatoc commented 9 years ago

@antirez not just once. :/

antirez commented 9 years ago

@renatoc thanks, no way to replicate so far, and the stack trace looks impossible, but running your script over valgrind I'm seeing warnings... Fixing those, and perhaps this may result from one of the issues I fixed earlier in the context of the previous issues you opened (unlikely btw). Please ping me if this happens again...

renatoc commented 9 years ago

@antirez Oh thanks My scenario of test: I have two machines running tests independently, and then when i did cluster meet bugs start showing

antirez commented 9 years ago

Oh that's kinda interesting... So you ran one test against, for example, Disque process A, one against Disque process B, then when you "cluster meet" bugs started to happen. This may related or not since, in general, once you join the cluster all the cluster logic is enabled so it covers a can of works ;-) Thanks for the additional info.

renatoc commented 9 years ago

yup, that's right. Anything that helps debug this, feel free to ask ;)

antirez commented 9 years ago

Thanks @renatoc!

sunheehnus commented 9 years ago

The stack trace looks impossible. >.<
Can't find a direct way changing nodes_delivered's function to dictEncObjHash...

antirez commented 9 years ago

Thanks @sunheehnus, the best I can think is that nodes_delivered was deallocated and a robj allocate at the same address or similar address, so that we are accessing a different dict... I'll investigate more, but I wonder if @renatoc is seeing this again. Maybe this was due to some previous bug now fixed, but it is not obvious at all :-) Thanks.

sunheehnus commented 9 years ago

@antirez

nodes_delivered was deallocated and a robj allocate at the same address or similar address

Agree with you. I will also investigate on this, maybe the leak occurs when "cluster meet". Keep reviewing. :-)

sunheehnus commented 9 years ago

Hi @antirez , any new info about this?

nodes_delivered was deallocated and a robj allocate at the same address or similar address

Just as you pointed out, some wrong deallocated operations may happen, and the memory(in use) is labeled free and reallocated.

I look through

git grep -nw "job\ "
git grep -nw "nodes_delivered"
...

Only find a hard-to-trigger error about memory allocation for job. dictGetSomeKeys does not guarantee to return non-duplicated elements, here may lead to wrong deallocated operations.

antirez commented 9 years ago

@sunheehnus that's very interesting!!! :-) Thank you. I need to check if there is the same possible problem in Redis as well...

antirez commented 9 years ago

@sunheehnus Bug fixed hopefully (did not tested the code... yet, we need automatic tests for this in the unit tests). I replaced the dictGetSomeKeys() call with the usual getRandomKey() call. This is perhaps less efficient but safer in this case, and to de-dup would be kinda of a nightmare... Your code review was GREAT! I really appreciate it :+1: And there are good reasons to believe that this bug is a memory corruption that was due to that. So also closing this bug, hoping that @renatoc will report again if it triggers it again in the future.

sunheehnus commented 9 years ago

Looks good to me. :+1: Thanks. :-)

antirez commented 9 years ago

Great, thank you!

renatoc commented 9 years ago

Hey Folks I have tested your latest code and same problem occurred. This time i created two gist files with full dump from two machine. Hope this help! Thanks!! :)

Machine 1 Machine 2

antirez commented 9 years ago

Thanks a ton @renatoc... checking ASAP.

renatoc commented 9 years ago

@antirez no prob. I am looking forward to see this awesome project as stable. ;)

antirez commented 9 years ago

Thanks! :-) I saw from your log a few things:

It's pretty easy for you to replicate (cool!), just a few seconds and it crashes.
There are issues in the cluster configuration, since nodes update their addresses frequently: please don't touch anything... this may be the special condition that makes it crashing on your instance.

So please if you could, try the following:

Build with: make clean; make noopt in order to create an executable without optimizations enabled.
Run the instances of Disque using gdb: gdb disque-server and when it starts run <arguments to disque>, likely you just want run disque.conf or alike. More details: http://redis.io/topics/debugging but should be not needed.
When it crashes, type bt to get a stack trace.

Other random ideas:

I've some feeling this could be due to node table corruption due to node address update. Not sure btw.
May be related to 22k connected clients? Possible but strange since Redis uses the same client handling code and we never saw things like that.
Your setup is very valuable, you managed to find a number of bugs in a short timeframe...

Thanks for your help!

renatoc commented 9 years ago

Hey @antirez

This time I could not replicate (not sure if it was to be in noopt mode). As was trying to run the test in the same way, it happened the following error. Cpu 100% of process and process was blocked (It may be because of not having optimizations?). Did the steps you suggested but there was no stack (i will attach screenshot). Do you have any extra suggestions? Yup, my setup is saved very carefully ;)

disque install disque config

Machine 1, disque 7711 log and error Machine 1, disque 7712 log Machine 2,disque 7711 log Machine 2,disque 7712 log

Cluster meet

error

Hope this help, if you need anything more please ask :)

sunheehnus commented 9 years ago

Hi @antirez @renatoc, the new error is triggered by this, we should check nodes_confirmed before we use it as a dict. Fixed in https://github.com/antirez/disque/pull/80 :-)

renatoc commented 9 years ago

Hey @sunheehnus Thanks for helping with this! ;)

Today tested (git hash c6346975) and got this error. If is another error not related to this let me know to create another issue.

sunheehnus commented 9 years ago

Hi @antirez @renatoc , a new catch.

See job.c#L1158, when (async && extrepl) is TRUE we didn't register a job. And so job.c#L1215-1217 we free a job directly.

But it is possible that when (async && extrepl) is TRUE we can trigger updateJobAwakeTime(job,0), when job->delay is not 0. updateJobAwakeTime(job, 0) will add the new job to server.awakeme list. After we release the new job directly, the freed new job can still be indexed in server.awakeme. This may lead to this issue. :-)

Already fixed in https://github.com/antirez/disque/pull/83

sunheehnus commented 9 years ago

Hey @antirez @renatoc , I think I get a new clue which is most likely to cause this crash. :-)

node1 is blocking for enough replicate,
node2 get a copy, requeue the job, ack the job, and send setack to node1,
node1 get setack msg and acknowledgeJob The acknowledgeJob will change job->state from JOB_STATE_REPL to JOB_STATE_ACKED, and trigger the GC to release the job, after release, the next clientsCron/ clientsCronHandleDelayedJobReplication will cause crash.

I have got a similar crash by:

remove cluster.c#L1310-1311
run a cluster with two nodes
get job from node2: getjob from 123
add job to node1: ADDJOB "123" "hello world!" 0 ttl 60
get job from node2 return a jobID and ack it: ackjob DIf21616924a815d517c10b87bd71742f9f36e23df0001SQ
after a while crash occur.

=== DISQUE BUG REPORT START: Cut & paste starting from here === 15958:P 28 Jun 22:33:30.351 # Disque 0.0.1 crashed by signal: 11 15958:P 28 Jun 22:33:30.351 # Failed assertion: (:0) 15958:P 28 Jun 22:33:30.351 # --- STACK TRACE ../disque-server :7000(logStackTrace+0x3e)[0x42695e] ../disque-server :7000(dictAddRaw+0x14)[0x411f04] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fed2893ecb0] ../disque-server :7000(dictAddRaw+0x14)[0x411f04] ../disque-server :7000(dictAdd+0x1e)[0x41208e] ../disque-server :7000(clusterReplicateJob+0x76)[0x42ad76] ../disque-server :7000(clientsCronHandleDelayedJobReplication+0x47)[0x42fec7] ../disque-server :7000(clientsCron+0x8a)[0x413a9a] ../disque-server :7000(serverCron+0xe1)[0x415001] ../disque-server :7000(aeProcessEvents+0x20b)[0x4101cb] ../disque-server :7000(aeMain+0x2b)[0x4103db] ../disque-server :7000(main+0x302)[0x40f162] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7fed2859176d] ../disque-server :7000[0x40f27d] 15958:P 28 Jun 22:33:30.352 # --- INFO OUTPUT 15958:P 28 Jun 22:33:30.352 # # Server disque_version:0.0.1 disque_git_sha1:2ea8e03d disque_git_dirty:1 disque_build_id:e2777b929074ebc7 os:Linux 3.5.0-49-generic x86_64 arch_bits:64 multiplexing_api:epoll gcc_version:4.6.3 process_id:15958 run_id:d444ff2817fc5deb1ea40470798963350b3ecb23 tcp_port:7000 uptime_in_seconds:613 uptime_in_days:0 hz:10 config_file:

Clients

connected_clients:1 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:1

Memory

used_memory:419520 used_memory_human:409.69K used_memory_rss:1937408 used_memory_peak:456144 used_memory_peak_human:445.45K mem_fragmentation_ratio:4.62 mem_allocator:jemalloc-3.6.0

Jobs

registered_jobs:0

Queues

registered_queues:0

Persistence

loading:0 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok

Stats

total_connections_received:3 total_commands_processed:3 instantaneous_ops_per_sec:0 total_net_input_bytes:197 total_net_output_bytes:0 instantaneous_input_kbps:0.00 instantaneous_output_kbps:0.00 rejected_connections:0 latest_fork_usec:0

CPU

used_cpu_sys:0.91 used_cpu_user:0.53 used_cpu_sys_children:0.00 used_cpu_user_children:0.00

Commandstats

cmdstat_addjob:calls=3,usec=72,usec_per_call=24.00 hash_init_value: 1436221972

15958:P 28 Jun 22:33:30.352 # --- CLIENT LIST OUTPUT 15958:P 28 Jun 22:33:30.352 # id=3 addr=127.0.0.1:49412 fd=11 name= age=536 idle=530 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob

15958:P 28 Jun 22:33:30.352 # --- REGISTERS 15958:P 28 Jun 22:33:30.352 # RAX:00007fed274495a0 RBX:0000000000000000 RCX:0000000000000009 RDX:00007fed2748a0c0 RDI:0000000000000000 RSI:00007fed2748a0c0 RBP:0000000000000000 RSP:00007fff8588c5b0 R8 :0000000000000001 R9 :1999999999999999 R10:0000000000000000 R11:0000000000000001 R12:00007fed2748a0c0 R13:0000000000000001 R14:0000000000000001 R15:0000000000000000 RIP:0000000000411f04 EFL:0000000000010202 CSGSFS:0000000000000033 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5bf) -> 000000002741e2c8 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5be) -> 0000000000000058 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5bd) -> 000000000042ad76 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5bc) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5bb) -> 00007fed27414100 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5ba) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b9) -> 000000000041208e 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b8) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b7) -> 0000000000000001 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b6) -> 0000000000000001 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b5) -> 00007fed2748a0c0 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b4) -> 00007fed27414100 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b3) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b2) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b1) -> 00007fed2748a0c0 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b0) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # --- FAST MEMORY TEST 15958:P 28 Jun 22:33:30.352 # Bio thread for job type #0 terminated 15958:P 28 Jun 22:33:30.352 # Bio thread for job type #1 terminated Testing 67a000 86016 Testing a12000 135168 Testing 7fed263ff000 8388608 Testing 7fed26c00000 12582912 Testing 7fed2787d000 4096 Testing 7fed28000000 4194304 Testing 7fed2892a000 20480 Testing 7fed28b48000 16384 Testing 7fed29044000 16384 Testing 7fed29067000 12288 15958:P 28 Jun 22:33:30.545 # Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible. 15958:P 28 Jun 22:33:30.545 # === DISQUE BUG REPORT END. Make sure to include from START to END. ===

   Please report the crash by opening an issue on github:

       http://github.com/antirez/disque/issues

Suspect RAM error? Use disque-server --test-memory to verify it.

[2] 15958 segmentation fault (core dumped) ../disque-server --port 7000

sunheehnus commented 9 years ago

Try to fix it in https://github.com/antirez/disque/pull/90 :-)

antirez commented 9 years ago

Bingo! :balloon:

That was a great investigation :-) The timeline of events occurring is very convincing indeed, I believe this is the case indeed and this bug is, finally, fixed.

About the implementation of the fix in #90, I'm more biased towards sending the client the Job ID instead of an error. After all it does not matter the replication was not achieved as the client expected: the job was correctly delivered. So in order to simplify the semantics of Disque, and the implementation of clients, it looks like a good idea to just return the ID as if the command successfully executed. After all this is expected semantic.

What I mean is that even without this set of events, what could happen is:

ADDJOB executed
ADDJOB correctly replicates the job.
Delay in sending the ADDJOB reply to the original client because of TCP network issues.
The job gets delivered and ACKed.
Finally the client that performed the ADDJOB receives the ID of the job.

So, it is a valid execution flow to get the reply of ADDJOB when the job was already delivered and acknowledged. This is why I would not add a new semantics in form of an error.

Thanks again!

sunheehnus commented 9 years ago

Great, a happy and meaningful Sunday night I spend. ;-) Your solution is more correct. I totally agree with you. I am so excited that finally we solve this. And now I am in bed away from my labrary. Could you please fix this bug just as you said, because I cannot wait to get this fixed. :)

sunheehnus commented 9 years ago

Hi @antirez ,

it does not matter the replication was not achieved as the client expected: the job was correctly delivered. So in order to simplify the semantics of Disque, and the implementation of clients, it looks like a good idea to just return the ID as if the command successfully executed. After all this is expected semantic.

Make sense and more correct!

The new commit in https://github.com/antirez/disque/pull/90 fixed this (sending the blocking client the Job ID instead of an error). :-)

antirez commented 9 years ago

@renatoc hey, we are pretty confident the issue is fixed now, may you please confirm? Thanks!

renatoc commented 9 years ago

@antirez Oh yeah, it's fixed :beer:

sunheehnus commented 9 years ago

:beer:

antirez commented 9 years ago

Great! Closing finally :-)

antirez / disque

"Unknown encoding type" #object.c:319 #73

Clients

Memory

Jobs

Queues

Persistence

Stats

CPU

Commandstats

I have got a similar crash by:

Clients

Memory

Jobs

Queues

Persistence

Stats

CPU

Commandstats