Closed renatoc closed 9 years ago
Thanks @renatoc, did you saw this multiple times running your script? I hope it repeats itself from time to time in order to gain a bit more context.
@antirez not just once. :/
@renatoc thanks, no way to replicate so far, and the stack trace looks impossible, but running your script over valgrind I'm seeing warnings... Fixing those, and perhaps this may result from one of the issues I fixed earlier in the context of the previous issues you opened (unlikely btw). Please ping me if this happens again...
@antirez Oh thanks My scenario of test: I have two machines running tests independently, and then when i did cluster meet bugs start showing
Oh that's kinda interesting... So you ran one test against, for example, Disque process A, one against Disque process B, then when you "cluster meet" bugs started to happen. This may related or not since, in general, once you join the cluster all the cluster logic is enabled so it covers a can of works ;-) Thanks for the additional info.
yup, that's right. Anything that helps debug this, feel free to ask ;)
Thanks @renatoc!
The stack trace looks impossible. >.<
Can't find a direct way changing nodes_delivered's function to dictEncObjHash...
Thanks @sunheehnus, the best I can think is that nodes_delivered was deallocated and a robj allocate at the same address or similar address, so that we are accessing a different dict... I'll investigate more, but I wonder if @renatoc is seeing this again. Maybe this was due to some previous bug now fixed, but it is not obvious at all :-) Thanks.
@antirez
nodes_delivered was deallocated and a robj allocate at the same address or similar address
Agree with you. I will also investigate on this, maybe the leak occurs when "cluster meet". Keep reviewing. :-)
Hi @antirez , any new info about this?
nodes_delivered was deallocated and a robj allocate at the same address or similar address
Just as you pointed out, some wrong deallocated operations may happen, and the memory(in use) is labeled free and reallocated.
I look through
git grep -nw "job\ "
git grep -nw "nodes_delivered"
...
Only find a hard-to-trigger error about memory allocation for job. dictGetSomeKeys does not guarantee to return non-duplicated elements, here may lead to wrong deallocated operations.
@sunheehnus that's very interesting!!! :-) Thank you. I need to check if there is the same possible problem in Redis as well...
@sunheehnus Bug fixed hopefully (did not tested the code... yet, we need automatic tests for this in the unit tests). I replaced the dictGetSomeKeys() call with the usual getRandomKey() call. This is perhaps less efficient but safer in this case, and to de-dup would be kinda of a nightmare... Your code review was GREAT! I really appreciate it :+1: And there are good reasons to believe that this bug is a memory corruption that was due to that. So also closing this bug, hoping that @renatoc will report again if it triggers it again in the future.
Looks good to me. :+1: Thanks. :-)
Great, thank you!
Thanks a ton @renatoc... checking ASAP.
@antirez no prob. I am looking forward to see this awesome project as stable. ;)
Thanks! :-) I saw from your log a few things:
So please if you could, try the following:
make clean; make noopt
in order to create an executable without optimizations enabled.gdb disque-server
and when it starts run <arguments to disque>
, likely you just want run disque.conf
or alike. More details: http://redis.io/topics/debugging but should be not needed.bt
to get a stack trace.Other random ideas:
Thanks for your help!
Hey @antirez
This time I could not replicate (not sure if it was to be in noopt mode). As was trying to run the test in the same way, it happened the following error. Cpu 100% of process and process was blocked (It may be because of not having optimizations?). Did the steps you suggested but there was no stack (i will attach screenshot). Do you have any extra suggestions? Yup, my setup is saved very carefully ;)
Machine 1, disque 7711 log and error Machine 1, disque 7712 log Machine 2,disque 7711 log Machine 2,disque 7712 log
Hope this help, if you need anything more please ask :)
Hi @antirez @renatoc, the new error is triggered by this, we should check nodes_confirmed before we use it as a dict. Fixed in https://github.com/antirez/disque/pull/80 :-)
Hey @sunheehnus Thanks for helping with this! ;)
Today tested (git hash c6346975) and got this error. If is another error not related to this let me know to create another issue.
Hi @antirez @renatoc , a new catch.
See job.c#L1158, when (async && extrepl) is TRUE we didn't register a job. And so job.c#L1215-1217 we free a job directly.
But it is possible that when (async && extrepl) is TRUE we can trigger updateJobAwakeTime(job,0), when job->delay
is not 0.
updateJobAwakeTime(job, 0)
will add the new job to server.awakeme list.
After we release the new job directly, the freed new job can still be indexed in server.awakeme. This may lead to this issue. :-)
Already fixed in https://github.com/antirez/disque/pull/83
Hey @antirez @renatoc , I think I get a new clue which is most likely to cause this crash. :-)
get
a copy, requeue
the job, ack
the job, and send setack
to node1,setack
msg and acknowledgeJob
The acknowledgeJob
will change job->state
from JOB_STATE_REPL
to JOB_STATE_ACKED
,
and trigger the GC to release the job, after release, the next clientsCron/ clientsCronHandleDelayedJobReplication
will cause crash.getjob from 123
ADDJOB "123" "hello world!" 0 ttl 60
ackjob DIf21616924a815d517c10b87bd71742f9f36e23df0001SQ
=== DISQUE BUG REPORT START: Cut & paste starting from here ===
15958:P 28 Jun 22:33:30.351 # Disque 0.0.1 crashed by signal: 11
15958:P 28 Jun 22:33:30.351 # Failed assertion:
connected_clients:1 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:1
used_memory:419520 used_memory_human:409.69K used_memory_rss:1937408 used_memory_peak:456144 used_memory_peak_human:445.45K mem_fragmentation_ratio:4.62 mem_allocator:jemalloc-3.6.0
registered_jobs:0
registered_queues:0
loading:0 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok
total_connections_received:3 total_commands_processed:3 instantaneous_ops_per_sec:0 total_net_input_bytes:197 total_net_output_bytes:0 instantaneous_input_kbps:0.00 instantaneous_output_kbps:0.00 rejected_connections:0 latest_fork_usec:0
used_cpu_sys:0.91 used_cpu_user:0.53 used_cpu_sys_children:0.00 used_cpu_user_children:0.00
cmdstat_addjob:calls=3,usec=72,usec_per_call=24.00 hash_init_value: 1436221972
15958:P 28 Jun 22:33:30.352 # --- CLIENT LIST OUTPUT 15958:P 28 Jun 22:33:30.352 # id=3 addr=127.0.0.1:49412 fd=11 name= age=536 idle=530 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob
15958:P 28 Jun 22:33:30.352 # --- REGISTERS 15958:P 28 Jun 22:33:30.352 # RAX:00007fed274495a0 RBX:0000000000000000 RCX:0000000000000009 RDX:00007fed2748a0c0 RDI:0000000000000000 RSI:00007fed2748a0c0 RBP:0000000000000000 RSP:00007fff8588c5b0 R8 :0000000000000001 R9 :1999999999999999 R10:0000000000000000 R11:0000000000000001 R12:00007fed2748a0c0 R13:0000000000000001 R14:0000000000000001 R15:0000000000000000 RIP:0000000000411f04 EFL:0000000000010202 CSGSFS:0000000000000033 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5bf) -> 000000002741e2c8 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5be) -> 0000000000000058 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5bd) -> 000000000042ad76 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5bc) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5bb) -> 00007fed27414100 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5ba) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b9) -> 000000000041208e 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b8) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b7) -> 0000000000000001 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b6) -> 0000000000000001 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b5) -> 00007fed2748a0c0 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b4) -> 00007fed27414100 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b3) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b2) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b1) -> 00007fed2748a0c0 15958:P 28 Jun 22:33:30.352 # (00007fff8588c5b0) -> 0000000000000000 15958:P 28 Jun 22:33:30.352 # --- FAST MEMORY TEST 15958:P 28 Jun 22:33:30.352 # Bio thread for job type #0 terminated 15958:P 28 Jun 22:33:30.352 # Bio thread for job type #1 terminated Testing 67a000 86016 Testing a12000 135168 Testing 7fed263ff000 8388608 Testing 7fed26c00000 12582912 Testing 7fed2787d000 4096 Testing 7fed28000000 4194304 Testing 7fed2892a000 20480 Testing 7fed28b48000 16384 Testing 7fed29044000 16384 Testing 7fed29067000 12288 15958:P 28 Jun 22:33:30.545 # Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible. 15958:P 28 Jun 22:33:30.545 # === DISQUE BUG REPORT END. Make sure to include from START to END. ===
Please report the crash by opening an issue on github:
http://github.com/antirez/disque/issues
Suspect RAM error? Use disque-server --test-memory to verify it.
[2] 15958 segmentation fault (core dumped) ../disque-server --port 7000
Try to fix it in https://github.com/antirez/disque/pull/90 :-)
Bingo! :balloon:
That was a great investigation :-) The timeline of events occurring is very convincing indeed, I believe this is the case indeed and this bug is, finally, fixed.
About the implementation of the fix in #90, I'm more biased towards sending the client the Job ID instead of an error. After all it does not matter the replication was not achieved as the client expected: the job was correctly delivered. So in order to simplify the semantics of Disque, and the implementation of clients, it looks like a good idea to just return the ID as if the command successfully executed. After all this is expected semantic.
What I mean is that even without this set of events, what could happen is:
So, it is a valid execution flow to get the reply of ADDJOB when the job was already delivered and acknowledged. This is why I would not add a new semantics in form of an error.
Thanks again!
Great, a happy and meaningful Sunday night I spend. ;-) Your solution is more correct. I totally agree with you. I am so excited that finally we solve this. And now I am in bed away from my labrary. Could you please fix this bug just as you said, because I cannot wait to get this fixed. :)
Hi @antirez ,
it does not matter the replication was not achieved as the client expected: the job was correctly delivered. So in order to simplify the semantics of Disque, and the implementation of clients, it looks like a good idea to just return the ID as if the command successfully executed. After all this is expected semantic.
Make sense and more correct!
The new commit in https://github.com/antirez/disque/pull/90 fixed this (sending the blocking client the Job ID instead of an error). :-)
@renatoc hey, we are pretty confident the issue is fixed now, may you please confirm? Thanks!
@antirez Oh yeah, it's fixed :beer:
:beer:
Great! Closing finally :-)
=== DISQUE BUG REPORT START: Cut & paste starting from here === 16525:P 21 May 11:30:15.598 # ------------------------------------------------ 16525:P 21 May 11:30:15.598 # !!! Software Failure. Press left mouse button to continue 16525:P 21 May 11:30:15.598 # Guru Meditation: "Unknown encoding type" #object.c:319 16525:P 21 May 11:30:15.598 # (forcing SIGSEGV in order to print the stack trace) 16525:P 21 May 11:30:15.598 # ------------------------------------------------ 16525:P 21 May 11:30:15.598 # Disque 0.0.1 crashed by signal: 11 16525:P 21 May 11:30:15.598 # Failed assertion: (:0)
16525:P 21 May 11:30:15.598 # --- STACK TRACE
/usr/local/bin/disque-server 0.0.0.0:7711(logStackTrace+0x3e)[0x4269be]
/usr/local/bin/disque-server 0.0.0.0:7711(_serverPanic+0x7f)[0x42625f]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f2766663cb0]
/usr/local/bin/disque-server 0.0.0.0:7711(_serverPanic+0x7f)[0x42625f]
/usr/local/bin/disque-server 0.0.0.0:7711(getDecodedObject+0xbf)[0x41f86f]
/usr/local/bin/disque-server 0.0.0.0:7711(dictEncObjHash+0x2c)[0x412ddc]
/usr/local/bin/disque-server 0.0.0.0:7711(dictAddRaw+0x3b)[0x411f2b]
/usr/local/bin/disque-server 0.0.0.0:7711(dictAdd+0x1e)[0x41208e]
/usr/local/bin/disque-server 0.0.0.0:7711(clusterReplicateJob+0x76)[0x42ade6]
/usr/local/bin/disque-server 0.0.0.0:7711(clientsCronHandleDelayedJobReplication+0x47)[0x42ff47]
/usr/local/bin/disque-server 0.0.0.0:7711(clientsCron+0x8a)[0x413a9a]
/usr/local/bin/disque-server 0.0.0.0:7711(serverCron+0xe1)[0x415001]
/usr/local/bin/disque-server 0.0.0.0:7711(aeProcessEvents+0x20b)[0x4101cb]
/usr/local/bin/disque-server 0.0.0.0:7711(aeMain+0x2b)[0x4103db]
/usr/local/bin/disque-server 0.0.0.0:7711(main+0x302)[0x40f162]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)[0x7f27662b776d]
/usr/local/bin/disque-server 0.0.0.0:7711[0x40f27d]
16525:P 21 May 11:30:15.599 # --- INFO OUTPUT
16525:P 21 May 11:30:15.601 # # Server
disque_version:0.0.1
disque_git_sha1:9ce5cb67
disque_git_dirty:0
disque_build_id:adcb5342e3e5a0ed
os:Linux 3.2.0-69-virtual x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.6.3
process_id:16525
run_id:c625c8ab6ce9cacc9cb54e04cb06752fff8f15af
tcp_port:7711
uptime_in_seconds:341
uptime_in_days:0
hz:10
config_file:/etc/disque/7711/disque.conf
Clients
connected_clients:14639 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:14639
Memory
used_memory:373456928 used_memory_human:356.16M used_memory_rss:173277184 used_memory_peak:380637792 used_memory_peak_human:363.00M mem_fragmentation_ratio:0.46 mem_allocator:jemalloc-3.6.0
Jobs
registered_jobs:54
Queues
registered_queues:1001
Persistence
loading:0 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok
Stats
total_connections_received:375779 total_commands_processed:928483 instantaneous_ops_per_sec:60 total_net_input_bytes:189129863 total_net_output_bytes:250909533 instantaneous_input_kbps:1.73 instantaneous_output_kbps:8.10 rejected_connections:0 latest_fork_usec:0
CPU
used_cpu_sys:49.50 used_cpu_user:62.80 used_cpu_sys_children:0.00 used_cpu_user_children:0.00
Commandstats
cmdstat_cluster:calls=2,usec=35,usec_per_call=17.50 cmdstat_hello:calls=375777,usec=1500012,usec_per_call=3.99 cmdstat_addjob:calls=184231,usec=1505720,usec_per_call=8.17 cmdstat_getjob:calls=191546,usec=796183,usec_per_call=4.16 cmdstat_ackjob:calls=176927,usec=929356,usec_per_call=5.25 hash_init_value: 1432287713
16525:P 21 May 11:30:15.601 # --- CLIENT LIST OUTPUT 16525:P 21 May 11:30:15.642 # id=375574 addr=127.0.0.1:13080 fd=12204 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=353307 addr=127.0.0.1:55631 fd=5862 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353308 addr=127.0.0.1:55632 fd=5986 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353309 addr=127.0.0.1:55634 fd=6008 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353310 addr=127.0.0.1:55636 fd=6024 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353312 addr=127.0.0.1:55638 fd=8346 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353314 addr=127.0.0.1:55640 fd=9665 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353317 addr=127.0.0.1:55643 fd=10474 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353320 addr=127.0.0.1:55646 fd=12737 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353323 addr=127.0.0.1:55650 fd=13208 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353324 addr=127.0.0.1:55649 fd=13307 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353327 addr=127.0.0.1:55657 fd=13799 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob id=353329 addr=127.0.0.1:55664 fd=738 name= age=46 idle=46 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=getjob ... ... ... id=375560 addr=127.0.0.1:13059 fd=10645 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375561 addr=127.0.0.1:13060 fd=10886 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375562 addr=127.0.0.1:13061 fd=10889 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375563 addr=127.0.0.1:13063 fd=11165 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375564 addr=127.0.0.1:13067 fd=11182 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375565 addr=127.0.0.1:13070 fd=11335 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375566 addr=127.0.0.1:13071 fd=11586 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375567 addr=127.0.0.1:13072 fd=11620 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375568 addr=127.0.0.1:13073 fd=11822 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375569 addr=127.0.0.1:13074 fd=11847 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375570 addr=127.0.0.1:13076 fd=11853 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375571 addr=127.0.0.1:13077 fd=11945 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375572 addr=127.0.0.1:13078 fd=12093 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob id=375573 addr=127.0.0.1:13079 fd=12186 name= age=30 idle=30 flags=b qbuf=0 qbuf-free=0 obl=0 oll=0 omem=0 events=r cmd=addjob
16525:P 21 May 11:30:15.644 # --- REGISTERS 16525:P 21 May 11:30:15.645 # RAX:0000000000000000 RBX:00000000004621f8 RCX:00007f276664d778 RDX:0000000000000000 RDI:0000000000020000 RSI:00007f276664d778 RBP:00000000004621c7 RSP:00007fff8d35d700 R8 :00007f276664d770 R9 :0000000000000001 R10:0000000000000000 R11:0000000000000206 R12:000000000000013f R13:00007f274eac5980 R14:0000000000000002 R15:0000000000000000 RIP:000000000042625f EFL:0000000000010206 CSGSFS:000000000000e033 16525:P 21 May 11:30:15.645 # (00007fff8d35d70f) -> 000000000043cf2d 16525:P 21 May 11:30:15.645 # (00007fff8d35d70e) -> 00007f2765c000c0 16525:P 21 May 11:30:15.645 # (00007fff8d35d70d) -> 00007f274e000000 16525:P 21 May 11:30:15.645 # (00007fff8d35d70c) -> 000000000000000c 16525:P 21 May 11:30:15.645 # (00007fff8d35d70b) -> 0000000000412ddc 16525:P 21 May 11:30:15.645 # (00007fff8d35d70a) -> 00007f274eac5980 16525:P 21 May 11:30:15.645 # (00007fff8d35d709) -> ddf8e1c14afef100 16525:P 21 May 11:30:15.645 # (00007fff8d35d708) -> 00007f2759b07148 16525:P 21 May 11:30:15.645 # (00007fff8d35d707) -> 0000000000410401 16525:P 21 May 11:30:15.645 # (00007fff8d35d706) -> 0000000000000001 16525:P 21 May 11:30:15.645 # (00007fff8d35d705) -> ddf8e1c14afef100 16525:P 21 May 11:30:15.645 # (00007fff8d35d704) -> 00007f2765c000c0 16525:P 21 May 11:30:15.645 # (00007fff8d35d703) -> 000000000041f86f 16525:P 21 May 11:30:15.645 # (00007fff8d35d702) -> 00007f276585d0c0 16525:P 21 May 11:30:15.645 # (00007fff8d35d701) -> 00007f274eac5980 16525:P 21 May 11:30:15.645 # (00007fff8d35d700) -> 00007f274eac5980 16525:P 21 May 11:30:15.645 # --- FAST MEMORY TEST 16525:P 21 May 11:30:15.645 # Bio thread for job type #0 terminated 16525:P 21 May 11:30:15.645 # Bio thread for job type #1 terminated 16525:P 21 May 11:30:19.644 # Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible. 16525:P 21 May 11:30:19.644 # === DISQUE BUG REPORT END. Make sure to include from START to END. ===