elastic / elasticsearch-cloud-aws

AWS Cloud Plugin for Elasticsearch
https://github.com/elastic/elasticsearch/tree/master/plugins/discovery-ec2
577 stars 181 forks source link

(Minor?) Small "race" in EC2 discovery of master nodes #237

Open ankon opened 8 years ago

ankon commented 8 years ago

Just got this now, and from what I see it is not in any way fatal as such, but to me it smells like there might be an issue either in ES itself, or in the discovery logic:

Node1:

[2015-08-24 16:35:06,991][INFO ][node                     ] [i-e9264844] version[1.7.1], pid[2426], build[b88f43f/2015-07-29T09:54:16Z]
[2015-08-24 16:35:06,992][INFO ][node                     ] [i-e9264844] initializing ...
[2015-08-24 16:35:13,188][INFO ][plugins                  ] [i-e9264844] loaded [lang-mvel, cloud-aws, mapper-attachments, mongodb-river], sites [head, river-mongodb]
[2015-08-24 16:35:13,329][INFO ][env                      ] [i-e9264844] using [1] data paths, mounts [[/opt/collaborne/data (/dev/xvdd)]], net usable_space [31.9gb], net total_space [31.9gb], types [xfs]
[2015-08-24 16:35:20,532][WARN ][script                   ] [i-e9264844] deprecated setting [script.disable_dynamic] is set, replace with fine-grained scripting settings (e.g. script.inline, script.indexed, script.file)
[2015-08-24 16:35:20,748][INFO ][node                     ] [i-e9264844] initialized
[2015-08-24 16:35:20,748][INFO ][node                     ] [i-e9264844] starting ...
[2015-08-24 16:35:20,964][INFO ][transport                ] [i-e9264844] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.0.20.233:9300]}
[2015-08-24 16:35:21,137][INFO ][discovery                ] [i-e9264844] elasticsearch/1UatDfVlSfyyE9R7M7xAbQ
[2015-08-24 16:35:27,074][INFO ][cluster.service          ] [i-e9264844] new_master [i-e9264844][1UatDfVlSfyyE9R7M7xAbQ][ip-10-0-20-233][inet[/10.0.20.233:9300]]{aws_availability_zone=eu-west-1a}, reason: zen-disco-join (elected_as_master)
[2015-08-24 16:35:27,113][INFO ][http                     ] [i-e9264844] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.0.20.233:9200]}
[2015-08-24 16:35:27,114][INFO ][node                     ] [i-e9264844] started
[2015-08-24 16:35:27,125][INFO ][gateway                  ] [i-e9264844] recovered [0] indices into cluster_state
[2015-08-24 16:35:28,413][INFO ][repositories             ] [i-e9264844] put repository [collaborne-data]

Node2:

[2015-08-24 16:35:05,897][INFO ][node                     ] [i-591ff6f5] version[1.7.1], pid[2423], build[b88f43f/2015-07-29T09:54:16Z]
[2015-08-24 16:35:05,898][INFO ][node                     ] [i-591ff6f5] initializing ...
[2015-08-24 16:35:12,344][INFO ][plugins                  ] [i-591ff6f5] loaded [lang-mvel, cloud-aws, mapper-attachments, mongodb-river], sites [head, river-mongodb]
[2015-08-24 16:35:12,472][INFO ][env                      ] [i-591ff6f5] using [1] data paths, mounts [[/opt/collaborne/data (/dev/xvdd)]], net usable_space [31.9gb], net total_space [31.9gb], types [xfs]
[2015-08-24 16:35:19,667][WARN ][script                   ] [i-591ff6f5] deprecated setting [script.disable_dynamic] is set, replace with fine-grained scripting settings (e.g. script.inline, script.indexed, script.file)
[2015-08-24 16:35:19,903][INFO ][node                     ] [i-591ff6f5] initialized
[2015-08-24 16:35:19,904][INFO ][node                     ] [i-591ff6f5] starting ...
[2015-08-24 16:35:20,244][INFO ][transport                ] [i-591ff6f5] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.0.21.123:9300]}
[2015-08-24 16:35:20,475][INFO ][discovery                ] [i-591ff6f5] elasticsearch/_Vuq_rgbTG2GcNtVsmVM_w
[2015-08-24 16:35:26,977][INFO ][discovery.ec2            ] [i-591ff6f5] failed to send join request to master [[i-e9264844][1UatDfVlSfyyE9R7M7xAbQ][ip-10-0-20-233][inet[/10.0.20.233:9300]]{aws_availability_zone=eu-west-1a}], reason [RemoteTransportException[[i-e9264844][inet[/10.0.20.233:9300]][internal:discovery/zen/join]]; nested: ElasticsearchIllegalStateException[Node [[i-e9264844][1UatDfVlSfyyE9R7M7xAbQ][ip-10-0-20-233][inet[/10.0.20.233:9300]]{aws_availability_zone=eu-west-1a}] not master for join request from [[i-591ff6f5][_Vuq_rgbTG2GcNtVsmVM_w][ip-10-0-21-123][inet[/10.0.21.123:9300]]{aws_availability_zone=eu-west-1b}]]; ], tried [3] times
[2015-08-24 16:35:31,848][INFO ][cluster.service          ] [i-591ff6f5] detected_master [i-e9264844][1UatDfVlSfyyE9R7M7xAbQ][ip-10-0-20-233][inet[/10.0.20.233:9300]]{aws_availability_zone=eu-west-1a}, added {[i-e9264844][1UatDfVlSfyyE9R7M7xAbQ][ip-10-0-20-233][inet[/10.0.20.233:9300]]{aws_availability_zone=eu-west-1a},}, reason: zen-disco-receive(from master [[i-e9264844][1UatDfVlSfyyE9R7M7xAbQ][ip-10-0-20-233][inet[/10.0.20.233:9300]]{aws_availability_zone=eu-west-1a}])
[2015-08-24 16:35:32,220][INFO ][http                     ] [i-591ff6f5] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.0.21.123:9200]}
[2015-08-24 16:35:32,223][INFO ][node                     ] [i-591ff6f5] started

Note the first "failed to send join request" in Node2's logs: the timing is awfully close to when the master actually says its a master. Times should be comparable, both nodes run ntp.

This is with elasticsearch-cloud-aws 2.7.0.

whybangbang commented 8 years ago

can you show your conf? maybe we have a common trouble

ankon commented 8 years ago

My configuration is fairly basic:

cloud:
        aws:
                protocol: https
                region: eu-west-1
        node:
                auto_attributes: true

discovery:
        type: ec2
        ec2:
                host_type: private_ip

Note though that the issue in this ticket isn't a problem as such, elasticsearch worked out the cluster configuration after all. It's mainly that there is a chance that something either in the cloud-aws or the core elasticsearch code does operations in the wrong order.

whybangbang commented 8 years ago

do you have solve it? we have a problem ,and i dont know how to solve it, can you help me ? We have to put a plugin to the produce envrionment, but when we stop a node, the log produce a error: "TransportService is closed stopped can't send request", and when we restart the node,it's a long time even 1 minute util cluster find the node, we don't know how to do we use AWS plugin elasticsearch 1.5.2

the conffig is :

discovery.type: "ec2" discovery.ec2.host_type: "private_ip" discovery.ec2.ping_timeout: "30s" if discovery.ec2.ping_timeout is short, cluster can't build