elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.16k stars 24.84k forks source link

ES 1.4.2 random node disconnect #9212

Closed dragosrosculete closed 8 years ago

dragosrosculete commented 9 years ago

Hey,

I am having trouble for some while. I am getting random node disconnects and I cannot explain why. There is no increase in traffic ( search or index ) when this is happening , it feels so random to me . I first thought it could be the aws cloud plugin so I removed it and used unicast and pointed directly to my nodes IPs but that didn't seem to be the problem . I changed the type of instances, now m3.2xlarge, added more instances, made so much modifications in ES yml config and still nothing . Changed java oracle from 1.7 to 1.8 , changed CMS collector to G1GC and still nothing .

I am out of ideas ... how can I get more info on what is going on ?

Here are the logs I can see from master node and the data node http://pastebin.com/GhKfRkaa

clintongormley commented 8 years ago

Nothing further on this ticket. Closing

razvanphp commented 8 years ago

I have the same problem, and I got out of ideas. The riding socket is out of question, since we use latest kernel (3.16.7-ckt20-1+deb8u1). ES version 1.7.4. Debian Jessie. Java build 1.8.0_66-internal-b17.

Here is the debug log:

[2016-01-18 21:38:27,599][TRACE][transport.tracer         ] [graylog-es-1-vm] [121786447][cluster:monitor/stats[n]] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [10m])
[2016-01-18 21:38:27,619][TRACE][transport.tracer         ] [graylog-es-1-vm] [121786447][cluster:monitor/stats[n]] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:38:39,979][TRACE][transport.tracer         ] [graylog-es-1-vm] [121788793][cluster:monitor/stats[n]] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [10m])
[2016-01-18 21:38:39,999][TRACE][transport.tracer         ] [graylog-es-1-vm] [121788793][cluster:monitor/stats[n]] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:38:50,991][DEBUG][transport.netty          ] [graylog-es-1-vm] disconnecting from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}], channel closed event
[2016-01-18 21:38:50,991][TRACE][transport.netty          ] [graylog-es-1-vm] disconnected from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}], channel closed event
[2016-01-18 21:38:50,998][INFO ][cluster.service          ] [graylog-es-1-vm] removed {[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false},}, reason: zen-disco-node_failed([graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}), reason transport disconnected
[2016-01-18 21:38:55,044][DEBUG][transport.netty          ] [graylog-es-1-vm] connected to node [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:38:55,044][TRACE][transport.tracer         ] [graylog-es-1-vm] [121790018][internal:discovery/zen/join/validate] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [null])
[2016-01-18 21:38:55,061][TRACE][transport.tracer         ] [graylog-es-1-vm] [121790018][internal:discovery/zen/join/validate] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:38:55,061][INFO ][cluster.service          ] [graylog-es-1-vm] added {[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false},}, reason: zen-disco-receive(join from node[[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}])
[2016-01-18 21:38:55,093][TRACE][transport.tracer         ] [graylog-es-1-vm] [121790022][internal:discovery/zen/publish] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [null])
[2016-01-18 21:38:55,261][TRACE][transport.tracer         ] [graylog-es-1-vm] [121790022][internal:discovery/zen/publish] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:39:04,854][TRACE][transport.tracer         ] [graylog-es-1-vm] [121791213][cluster:monitor/stats[n]] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [10m])
[2016-01-18 21:39:04,874][TRACE][transport.tracer         ] [graylog-es-1-vm] [121791213][cluster:monitor/stats[n]] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}]
[2016-01-18 21:39:17,381][TRACE][transport.tracer         ] [graylog-es-1-vm] [121793559][cluster:monitor/stats[n]] sent to [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}] (timeout: [10m])
[2016-01-18 21:39:17,401][TRACE][transport.tracer         ] [graylog-es-1-vm] [121793559][cluster:monitor/stats[n]] received response from [[graylog2-server][tzklxLueQ8OApiHUMdK0og][glog-o-master2][inet[10.107.61.96/10.107.61.96:9300]]{client=true, data=false, master=false}]

It is true that we have a rather big cluster, but only the master disconnects, not the nodes. They communicate through a VPN tunnel, maybe somebody has another idea how to improve this.

Is it normal that the nodes are queried so often with the stats, i.e. each few seconds?

Thank you!