Netflix / dynomite

A generic dynamo implementation for different k-v storage engines
Apache License 2.0
4.2k stars 532 forks source link

dynomite crashing when a peer node goes down #678

Closed mattsdevop closed 5 years ago

mattsdevop commented 5 years ago

I am experiencing an issue in any version other than dynomite-v0.5.9-5_MuslCompatiblity. OS: centos-release-7-6.1810.2.el7.centos.x86_64

This is what I was used to seeing in the log when not building with --enable-debug=yes in v0.6.14, v0.6.9, and 0.7

[2019-07-01 15:56:52.940] dn_stacktrace:286 [0] /lib64/libpthread.so.0(+0xf5d0) [0x7fdd294ed5d0]
[2019-07-01 15:56:52.946] dn_stacktrace:286 [1] src/dynomite() [0x41e63b]
[2019-07-01 15:56:52.951] dn_stacktrace:286 [2] src/dynomite() [0x41e7ad]
[2019-07-01 15:56:52.957] dn_stacktrace:286 [3] src/dynomite(core_loop+0xdd) [0x413a8d]
[2019-07-01 15:56:52.962] dn_stacktrace:286 [4] src/dynomite(main+0x6a0) [0x40cd10]
[2019-07-01 15:56:52.966] dn_stacktrace:286 [5] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7fdd28e313d5]
[2019-07-01 15:56:52.970] dn_stacktrace:286 [6] src/dynomite() [0x40d109]
[2019-07-01 15:56:52.975] signal_handler:123 signal 11 (SIGSEGV) received, core dumping

I rebuilt with CFLAGS="-ggdb3 -O0" ./configure --enable-debug=full and use gdb to retrieve the following:

[2019-07-01 16:35:13.795] proxy_accept:203 <CONN_PROXY 0x6a6690 15 listening on '0.0.0.0:8102'> accepted <CONN_CLIENT 0x74c400 649 from '192.168.1.248:33942'>
[2019-07-01 16:35:13.837] conn_connect:278 <CONN_LOCAL_PEER_SERVER 0x74c730 650 to '192.168.1.89:8101:rack2:dc1:0'> connecting.....
[2019-07-01 16:35:13.837] core_close:414 close <CONN_LOCAL_PEER_SERVER 0x74c730 650 to '192.168.1.89:8101:rack2:dc1:0'> on event FFFFFF eof 0 done 0 rb 0 sb 0: Connection refused
[2019-07-01 16:35:13.837] dnode_peer_close:420 <CONN_LOCAL_PEER_SERVER 0x74c730 650 to '192.168.1.89:8101:rack2:dc1:0'> Closing, Dropped 0 outqueue & 0 inqueue requests
[2019-07-01 16:35:13.837] event_del_conn:211 epoll ctl on e 13 sd 650 failed: No such file or directory
[2019-07-01 16:35:13.837] dnode_peer_unref:55 Marking <NODE 0x6a6e90 192.168.117.89 dc1 rack2 secured:0> as down
[2019-07-01 16:35:13.887] dn_assert:301 assert 'req->is_request' failed @ (dyn_server.c, 727)
[2019-07-01 16:35:13.891] dn_stacktrace:286 [0] /root/dynomite/src/dynomite() [0x42e03d]
Detaching after fork from child process 24681.
/root/dynomite/src/dyn_server.c:729
[2019-07-01 16:35:13.897] dn_stacktrace:286 [1] /root/dynomite/src/dynomite() [0x42e4b8]
Detaching after fork from child process 24683.
/root/dynomite/src/dyn_server.c:803
[2019-07-01 16:35:13.903] dn_stacktrace:286 [2] /root/dynomite/src/dynomite() [0x426513]
Detaching after fork from child process 24685.
/root/dynomite/src/dyn_message.c:785
[2019-07-01 16:35:13.908] dn_stacktrace:286 [3] /root/dynomite/src/dynomite() [0x426759]
Detaching after fork from child process 24687.
/root/dynomite/src/dyn_message.c:845
[2019-07-01 16:35:13.913] dn_stacktrace:286 [4] /root/dynomite/src/dynomite() [0x426d6d]
Detaching after fork from child process 24689.
/root/dynomite/src/dyn_message.c:1009
[2019-07-01 16:35:13.918] dn_stacktrace:286 [5] /root/dynomite/src/dynomite(msg_recv+0x99) [0x426e5b]
Detaching after fork from child process 24691.
/root/dynomite/src/dyn_message.c:1040
[2019-07-01 16:35:13.923] dn_stacktrace:286 [6] /root/dynomite/src/dynomite() [0x416184]
Detaching after fork from child process 24693.
/root/dynomite/src/dyn_core.c:384
[2019-07-01 16:35:13.928] dn_stacktrace:286 [7] /root/dynomite/src/dynomite(core_core+0x10a) [0x4167d1]
Detaching after fork from child process 24695.
/root/dynomite/src/dyn_core.c:517
[2019-07-01 16:35:13.933] dn_stacktrace:286 [8] /root/dynomite/src/dynomite(event_wait+0x187) [0x453bd0]
Detaching after fork from child process 24697.
/root/dynomite/src/event/dyn_epoll.c:234
[2019-07-01 16:35:13.938] dn_stacktrace:286 [9] /root/dynomite/src/dynomite(core_loop+0x7c) [0x417048]
Detaching after fork from child process 24699.
/root/dynomite/src/dyn_core.c:646 (discriminator 3)
[2019-07-01 16:35:13.943] dn_stacktrace:286 [10] /root/dynomite/src/dynomite() [0x43e5e7]
Detaching after fork from child process 24701.
/root/dynomite/src/dynomite.c:564
[2019-07-01 16:35:13.948] dn_stacktrace:286 [11] /root/dynomite/src/dynomite(main+0x11c) [0x43e74f]
Detaching after fork from child process 24703.
/root/dynomite/src/dynomite.c:622
[2019-07-01 16:35:13.953] dn_stacktrace:286 [12] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7ffff6c253d5]
Detaching after fork from child process 24705.
??:0
[2019-07-01 16:35:13.957] dn_stacktrace:286 [13] /root/dynomite/src/dynomite() [0x4097b9]
Detaching after fork from child process 24707.
??:?

Program received signal SIGABRT, Aborted.
0x00007ffff6c39207 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
55        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);

Config file:

dyn_o_mite:
  env: network
  datacenter: dc1
  rack: rack1
  dyn_listen: 0.0.0.0:8101
  dyn_seed_provider: simple_provider
  dyn_seeds:
  - 192.168.1.89:8101:rack2:dc1:0
  - 192.168.1.91:8101:rack3:dc1:0
  listen: 0.0.0.0:8102
  servers:
  - 127.0.0.1:6379:1
  tokens: 0
  secure_server_option: none
  pem_key_file: /apps/dynomite/conf/dynomite.pem
  recon_key_file: /apps/dynomite/conf/recon_key.pem
  recon_iv_file: /apps/dynomite/conf/recon_iv.pem
  data_store: 0
  stats_listen: 0.0.0.0:22221
  auto_eject_hosts: true
  server_retry_timeout: 30000
  server_failure_limit: 3
  read_consistency: dc_one
  write_consistency: dc_one
  max_msgs: 900000

To recreate: Build v0.6.14, v0.6.9, or 0.7. On 3 separate VMs or servers running dynomite with the given configuration file. Once dynomite is running on all 3 servers, point your application to it to generate some load on dynomite. With traffic/load on the dynomite instance, manually kill one of the other dynomite services or reboot the server that another dynomite is running on. This will cause the main dynomite to Segmentation fault and crash. This is consistently reproducible in my environment. These are all local machines connected over a network. Version v0.5.9 is able to stay running if another node fails without crashing. However, it does generate this message in the logs when another node goes down:

event_del_conn:211 epoll ctl on e 13 sd 663 failed: No such file or directory

This seems to happen when their is traffic/load on the dynomite node. I have attempted increasing the hardware resources, but that doesn't change the behavior. It still runs fine on v0.5.9 and not others. Any chance that I will be able to run a newer version?

smukil commented 5 years ago

@mattsdevop Let me try to repro this and get back.

smukil commented 5 years ago

@mattsdevop I've finally figured out the bug and fixed it. Will merge after review shortly. Thanks for raising the issue.

https://github.com/Netflix/dynomite/pull/683

smukil commented 5 years ago

@mattsdevop v0.6.15 has the fix: https://github.com/Netflix/dynomite/tree/v0.6.15

Closing this.