Netflix / dynomite

A generic dynamo implementation for different k-v storage engines
Apache License 2.0
4.19k stars 530 forks source link

Dynomite segfaults on Memcached "get" command #550

Open ahmed-sigmalux opened 6 years ago

ahmed-sigmalux commented 6 years ago

I'm unable to read from Memcached via Dynomite.

EDIT: See message below.

ahmed-sigmalux commented 6 years ago

A simple get command causes dynomite to crash:

$ telnet localhost 21211
Trying ::1...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
get greeting
Connection closed by foreign host.

Dynomite in debug output shows:

[2018-03-30 04:22:33.975] event_add_out:150 adding conn <CONN_SERVER 0x561f86b84ff0 4 to '127.0.0.1:11211:1'> to active
[2018-03-30 04:22:33.975] msg_tmo_insert:269 insert req 2 into tmo rbt with expiry of 5000 msec
[2018-03-30 04:22:33.975] req_server_enqueue_imsgq:963 conn 0x561f86b84ff0 enqueue inq 2:0
[2018-03-30 04:22:33.975] local_req_forward:605 <CONN_CLIENT 0x561f86b973b0 8 from '127.0.0.1:36342'> local forward <REQ 0x561f86b9b9c0 2:0::1 REQ_MC_GET, len:15> to <CONN_SERVER 0x561f86b84ff0 4 to '127.0.0.1:11211:1'> len 15 key ''
[2018-03-30 04:22:33.975] core_core:508 event FF00 on <CONN_SERVER 0x561f86b84ff0 4 to '127.0.0.1:11211:1'>
[2018-03-30 04:22:33.975] req_send_next:908 send next req 2 len 15 type 1 on s 4
[2018-03-30 04:22:33.975] msg_send_chain:1089 About to dump out the content of msg
[2018-03-30 04:22:33.975] msg_dump:680 msg dump id 2 request 1 len 15 type 1 done 0 error 0 (err 0)
[2018-03-30 04:22:33.975] mbuf_dump:156 mbuf 0x561f86ba3c90 with 4 bytes of data
00000000  67 65 74 20                                        |get |
[2018-03-30 04:22:33.975] mbuf_dump:156 mbuf 0x561f86b9fc70 with 11 bytes of data
00000000  67 72 65 65 74 69 6e 67  20 0d 0a                  |greeting ..|
[2018-03-30 04:22:33.975] msg_dump:685 =================================================
[2018-03-30 04:22:33.975] conn_sendv_data:447 sendv on sd 4 15 of 15 in 2 buffers
[2018-03-30 04:22:33.975] req_send_done:925 send done req 2 len 15 type 1 on s 4
[2018-03-30 04:22:33.975] req_server_dequeue_imsgq:977 conn 0x561f86b84ff0 dequeue inq 2:0
[2018-03-30 04:22:33.975] req_server_enqueue_omsgq:993 conn 0x561f86b84ff0 enqueue outq 2:0
[2018-03-30 04:22:33.975] event_del_out:181 removing conn <CONN_SERVER 0x561f86b84ff0 4 to '127.0.0.1:11211:1'> from active
[2018-03-30 04:22:33.976] core_core:508 event 00FF on <CONN_SERVER 0x561f86b84ff0 4 to '127.0.0.1:11211:1'>
[2018-03-30 04:22:33.976] _msg_get:324 alloc_msg_count: 4 caller: rsp_get <CONN_SERVER 0x561f86b84ff0 4 to '127.0.0.1:11211:1'>
[2018-03-30 04:22:33.976] _dn_alloc:259 malloc(720) at 0x561f86ba4050 @ dyn_message.c:326
[2018-03-30 04:22:33.976] _dn_alloc:259 malloc(32) at 0x561f86ba4330 @ ../dyn_array.c:34
[2018-03-30 04:22:33.976] _dn_alloc:259 malloc(32) at 0x561f86ba4360 @ ../dyn_array.c:39
[2018-03-30 04:22:33.976] msg_get:467 get msg 0x561f86ba4050 id 4 request 0 owner sd 4
[2018-03-30 04:22:33.976] _dn_alloc:259 malloc(16400) at 0x561f86ba4390 @ dyn_mbuf.c:60
[2018-03-30 04:22:33.976] mbuf_get:121 get mbuf 0x561f86ba8360
[2018-03-30 04:22:33.976] mbuf_insert:223 insert mbuf 0x561f86ba8360 len 0
[2018-03-30 04:22:33.976] conn_recv_data:396 <CONN_SERVER 0x561f86b84ff0 4 to '127.0.0.1:11211:1'> recv 39 of 16320
[2018-03-30 04:22:33.976] memcache_parse_rsp:1221 parsed rsp 4 res 0 type 19 state 0 rpos 39 of 39
00000000  56 41 4c 55 45 20 67 72  65 65 74 69 6e 67 20 31   |VALUE greeting 1|
00000010  20 31 31 0d 0a 48 65 6c  6c 6f 20 77 6f 72 6c 64   | 11..Hello world|
00000020  0d 0a 45 4e 44 0d 0a                               |..END..|
[2018-03-30 04:22:33.976] server_ok:323 reset server '127.0.0.1:11211:1' failure count from 0 to 0
[2018-03-30 04:22:33.976] msg_tmo_delete:289 delete req 2 from tmo rbt
[2018-03-30 04:22:33.976] req_server_dequeue_omsgq:1009 conn 0x561f86b84ff0 dequeue outq 2:0
[2018-03-30 04:22:33.976] server_rsp_forward:834 <CONN_CLIENT 0x561f86b973b0 8 from '127.0.0.1:36342'> <REQ 0x561f86b9b9c0 2:0::1 REQ_MC_GET, len:15> RECEIVED <RSP 0x561f86ba4050 4:0 RSP_MC_END len:39>
[2018-03-30 04:22:33.976] msg_local_one_rsp_handler:1018 -2034648624 SELECTED -2034614176
[2018-03-30 04:22:33.976] dn_stacktrace:326 [0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0) [0x7febc12540c0]
??:0
[2018-03-30 04:22:33.978] dn_stacktrace:326 [1] dynomite(memcache_post_coalesce+0x1f) [0x561f86814a4f]
??:0
[2018-03-30 04:22:33.980] dn_stacktrace:326 [2] dynomite(req_done+0x1c0) [0x561f8681e270]
??:0
[2018-03-30 04:22:33.982] dn_stacktrace:326 [3] dynomite(+0x622db) [0x561f868422db]
??:0
[2018-03-30 04:22:33.983] dn_stacktrace:326 [4] dynomite(msg_recv+0x5e6) [0x561f86820ad6]
??:0
[2018-03-30 04:22:33.985] dn_stacktrace:326 [5] dynomite(core_core+0x11f) [0x561f8683227f]
??:0
[2018-03-30 04:22:33.986] dn_stacktrace:326 [6] dynomite(event_wait+0x1a1) [0x561f8681d1f1]
??:0
[2018-03-30 04:22:33.988] dn_stacktrace:326 [7] dynomite(core_loop+0x109) [0x561f86836789]
??:0
[2018-03-30 04:22:33.989] dn_stacktrace:326 [8] dynomite(main+0x708) [0x561f867f2578]
??:0
[2018-03-30 04:22:33.991] dn_stacktrace:326 [9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1) [0x7febc0bc02e1]
??:0
[2018-03-30 04:22:33.993] dn_stacktrace:326 [10] dynomite(_start+0x2a) [0x561f867f29ea]
??:0
[2018-03-30 04:22:33.994] signal_handler:132 signal 11 (SIGSEGV) received, core dumping
Segmentation fault
ipapapa commented 6 years ago

At this point, we do not have the capacity to work on the Memcached side. We have main mainly focused on the Redis wire protocol. If you would like to assist us there are a few related issues it should not be hard to identify were the protocol fails. See related issues:

https://github.com/Netflix/dynomite/issues/504 https://github.com/Netflix/dynomite/issues/220

We also have a project open for folks who want to assist: https://github.com/Netflix/dynomite/projects

KowalczykBartek commented 6 years ago

Hi Dynomite team,

As I mentioned on gitter chanell (sorry for delay) I was going thorough Dynomite source code trying to find root of problem reported here, I am pretty sure (of course, I am not C dev, I can be wrong) that problem is not parsing of Memcache protocol itself, but problem is in post_coalesce logic, e.g here

ASSERT(!response->is_request);
ASSERT(request->is_request && (request->frag_owner == request));

When I reach Redis post_coalesce, also, similar problem occurs :

[2018-05-31 23:57:50.638] dn_assert:342 assert '0' failed @ (dyn_redis.c, 2973)
[2018-05-31 23:57:50.638] dn_stacktrace:326 [0] 2   dynomite                            0x0000000103da4810 redis_fragment_argx + 1200
sh: addr2line: command not found
[2018-05-31 23:57:50.644] dn_stacktrace:326 [1] 3   dynomite                            0x0000000103d6b9db req_recv_done + 555
sh: addr2line: command not found
[2018-05-31 23:57:50.649] dn_stacktrace:326 [2] 4   dynomite                            0x0000000103d818e4 msg_recv + 1284
sh: addr2line: command not found
[2018-05-31 23:57:50.653] dn_stacktrace:326 [3] 5   dynomite                            0x0000000103d744f6 core_core + 326
sh: addr2line: command not found
[2018-05-31 23:57:50.657] dn_stacktrace:326 [4] 6   dynomite                            0x0000000103da6160 event_wait + 448
sh: addr2line: command not found
[2018-05-31 23:57:50.662] dn_stacktrace:326 [5] 7   dynomite                            0x0000000103d74ee9 core_loop + 505
sh: addr2line: command not found
[2018-05-31 23:57:50.666] dn_stacktrace:326 [6] 8   dynomite                            0x0000000103d94fa8 main + 1464
sh: addr2line: command not found
[2018-05-31 23:57:50.670] dn_stacktrace:326 [7] 9   libdyld.dylib                       0x00007fff5744e115 start + 1
sh: addr2line: command not found
[2018-05-31 23:57:50.673] dn_stacktrace:326 [8] 10  ???                                 0x0000000000000003 0x0 + 3
shailesh33 commented 6 years ago

@KowalczykBartek I think this has to do with the way request is fragmented in dyn_memcache.c Let me see if I can dig deeper here today or tomorrow., is it straight forward to reproduce

shailesh33 commented 6 years ago

Can you try commenting the assert and see if it helps? ( i doubt but might be worth?)

KowalczykBartek commented 6 years ago

@shailesh33 sorry for update almost after one month :D I was not able to reproduce my Redis fail. But be back to memcached, this is state in GDB

Thread 1 "dynomite" hit Breakpoint 1, memcache_post_coalesce (request=0x555555802000) at dyn_memcache.c:1541
1541        struct msg *response = request->peer;
(gdb) p request->peer
$1 = (struct msg *) 0x0

because peer is uninitializedpoints to NULL) it causes SIGSEGV

btw, I started to play with GDB and Dynomite, and I found first problem ( there was used property peer instead of selected_rsp) - of course I can be wrong (I didn't deep into source code enough, I just spotted that Redis implementation is using selected_rsp as well, and that problem that I described above is not implementation depended). I made first change, https://github.com/KowalczykBartek/dynomite/pull/1/files and looks that it fixed one problem, but of course there is another one,

[2018-07-15 01:30:56.882] server_connected:309 <CONN_SERVER 0x7f8ff7d00be0 11 to '127.0.0.1:22122:1'> connected 
[2018-07-15 01:31:01.376] proxy_accept:214 <CONN_PROXY 0x7f8ff7d01700 12 listening on '127.0.0.1:8102'> accepted <CONN_CLIENT 0x7f8ff7e00070 14 from '127.0.0.1:51936'>
[2018-07-15 01:31:08.799] dn_stacktrace:326 [0] 2   libsystem_platform.dylib            0x00007fff5f43ef5a _sigtramp + 26
sh: addr2line: command not found
[2018-07-15 01:31:08.803] dn_stacktrace:326 [1] 3   ???                                 0x00000001003c2046 0x0 + 4298907718
sh: addr2line: command not found
[2018-07-15 01:31:08.806] dn_stacktrace:326 [2] 4   dynomite                            0x000000010007e802 dnode_peer_pool_server + 258
sh: addr2line: command not found
[2018-07-15 01:31:08.809] dn_stacktrace:326 [3] 5   dynomite                            0x000000010006f514 remote_req_forward + 132
sh: addr2line: command not found
[2018-07-15 01:31:08.812] dn_stacktrace:326 [4] 6   dynomite                            0x00000001000703a5 req_forward + 1797
sh: addr2line: command not found
[2018-07-15 01:31:08.815] dn_stacktrace:326 [5] 7   dynomite                            0x000000010006fc4e req_recv_done + 990
sh: addr2line: command not found
[2018-07-15 01:31:08.817] dn_stacktrace:326 [6] 8   dynomite                            0x0000000100084a48 msg_recv + 1320
sh: addr2line: command not found
[2018-07-15 01:31:08.820] dn_stacktrace:326 [7] 9   dynomite                            0x0000000100077f16 core_core + 326
sh: addr2line: command not found
[2018-07-15 01:31:08.823] dn_stacktrace:326 [8] 10  dynomite                            0x00000001000a81ee event_wait + 462
sh: addr2line: command not found
[2018-07-15 01:31:08.826] dn_stacktrace:326 [9] 11  dynomite                            0x000000010007890b core_loop + 507
sh: addr2line: command not found
[2018-07-15 01:31:08.829] dn_stacktrace:326 [10] 12  dynomite                            0x0000000100098d18 main + 1448
sh: addr2line: command not found
[2018-07-15 01:31:08.831] dn_stacktrace:326 [11] 13  libdyld.dylib                       0x00007fff5f130015 start + 1
sh: addr2line: command not found
[2018-07-15 01:31:08.834] dn_stacktrace:326 [12] 14  ???                                 0x0000000000000003 0x0 + 3
sh: addr2line: command not found
[2018-07-15 01:31:08.837] signal_handler:132 signal 11 (SIGSEGV) received, core dumping
Segmentation fault: 11 (core dumped)
C02QF0Z2G8WM:dynomite i316755$ git status

but happens after ~2/3 correct reads. I will try to debug it futher or maybe someone will be interested in playing with this bug as well.

Amritanshu02 commented 3 years ago

Hi! I would like to work on this issue.Can you suggest me where to start ?