elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.95k stars 24.74k forks source link

Zen multicast discovery - pings received but ignored #8953

Closed yaronr closed 9 years ago

yaronr commented 9 years ago

The following happens intermittently. Restarts 'solve' the problem. I verified that the nodes can ping each other.

[2014-12-15 13:04:39,080][INFO ][node ] [elasticsearch-1-aws-east-1] version[1.4.0], pid[10], build[bc94bd8/2014-11-05T14:26:12Z] [2014-12-15 13:04:39,082][INFO ][node ] [elasticsearch-1-aws-east-1] initializing ... [2014-12-15 13:04:39,137][INFO ][plugins ] [elasticsearch-1-aws-east-1] loaded [marvel], sites [marvel, bigdesk, head, kopf] [2014-12-15 13:04:43,939][DEBUG][discovery.zen.elect ] [elasticsearch-1-aws-east-1] using minimum_master_nodes [-1] [2014-12-15 13:04:43,943][DEBUG][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] using group [224.2.2.4], with port [54328], ttl [10], and address [ethwe:ipv4] [2014-12-15 13:04:43,950][DEBUG][discovery.zen.ping.unicast] [elasticsearch-1-aws-east-1] using initial hosts [], with concurrent_connects [10] [2014-12-15 13:04:43,953][DEBUG][discovery.zen ] [elasticsearch-1-aws-east-1] using ping.timeout [3s], join.timeout [1m], master_election.filter_client [true], master_election.filter_data [false] [2014-12-15 13:04:43,956][DEBUG][discovery.zen.fd ] [elasticsearch-1-aws-east-1] [master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3] [2014-12-15 13:04:43,961][DEBUG][discovery.zen.fd ] [elasticsearch-1-aws-east-1] [node ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3] [2014-12-15 13:04:45,242][INFO ][node ] [elasticsearch-1-aws-east-1] initialized [2014-12-15 13:04:45,244][INFO ][node ] [elasticsearch-1-aws-east-1] starting ... [2014-12-15 13:04:45,547][INFO ][transport ] [elasticsearch-1-aws-east-1] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.100.0.7:9300]} [2014-12-15 13:04:45,580][INFO ][discovery ] [elasticsearch-1-aws-east-1] multicloud/psYxXuGZQq6fBlZ4nWB6Sw [2014-12-15 13:04:45,584][TRACE][discovery.zen ] [elasticsearch-1-aws-east-1] starting to ping [2014-12-15 13:04:45,601][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] sending ping request [2014-12-15 13:04:45,611][TRACE][discovery.zen.ping.unicast] [elasticsearch-1-aws-east-1] [1] connecting to [elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1} [2014-12-15 13:04:45,664][TRACE][discovery.zen.ping.unicast] [elasticsearch-1-aws-east-1] [1] connected to [elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1} [2014-12-15 13:04:45,666][TRACE][discovery.zen.ping.unicast] [elasticsearch-1-aws-east-1] [1] sending to [elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1} [2014-12-15 13:04:45,721][TRACE][discovery.zen.ping.unicast] [elasticsearch-1-aws-east-1] [1] received response from [elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}: [ping_response{node [[elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}], id[1], master [null], hasJoinedOnce [false], cluster_name[multicloud]}, ping_response{node [[elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}], id[2], master [null], hasJoinedOnce [false], cluster_name[multicloud]}] [2014-12-15 13:04:47,107][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] sending ping request [2014-12-15 13:04:47,111][TRACE][discovery.zen.ping.unicast] [elasticsearch-1-aws-east-1] [1] sending to [elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1} [2014-12-15 13:04:47,118][TRACE][discovery.zen.ping.unicast] [elasticsearch-1-aws-east-1] [1] received response from [elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}: [ping_response{node [[elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}], id[1], master [null], hasJoinedOnce [false], cluster_name[multicloud]}, ping_response{node [[elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}], id[3], master [null], hasJoinedOnce [false], cluster_name[multicloud]}, ping_response{node [[elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}], id[4], master [null], hasJoinedOnce [false], cluster_name[multicloud]}] [2014-12-15 13:04:48,612][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] sending last pings [2014-12-15 13:04:48,615][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] sending ping request [2014-12-15 13:04:48,624][TRACE][discovery.zen.ping.unicast] [elasticsearch-1-aws-east-1] [1] sending to [elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1} [2014-12-15 13:04:48,629][TRACE][discovery.zen.ping.unicast] [elasticsearch-1-aws-east-1] [1] received response from [elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}: [ping_response{node [[elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}], id[1], master [null], hasJoinedOnce [false], cluster_name[multicloud]}, ping_response{node [[elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}], id[3], master [null], hasJoinedOnce [false], cluster_name[multicloud]}, ping_response{node [[elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}], id[5], master [null], hasJoinedOnce [false], cluster_name[multicloud]}, ping_response{node [[elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}], id[6], master [null], hasJoinedOnce [false], cluster_name[multicloud]}] [2014-12-15 13:04:49,373][TRACE][discovery.zen ] [elasticsearch-1-aws-east-1] full ping responses: {none} [2014-12-15 13:04:49,373][DEBUG][discovery.zen ] [elasticsearch-1-aws-east-1] filtered ping responses: (filter_client[true], filter_data[false]) {none} [2014-12-15 13:04:49,387][INFO ][cluster.service ] [elasticsearch-1-aws-east-1] new_master [elasticsearch-1-aws-east-1][psYxXuGZQq6fBlZ4nWB6Sw][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.7:9300]]{cluster.name=aws-east-1}, reason: zen-disco-join (elected_as_master) [2014-12-15 13:04:49,403][TRACE][discovery.zen ] [elasticsearch-1-aws-east-1] cluster joins counter set to [1](elected as master) [2014-12-15 13:04:49,450][INFO ][http ] [elasticsearch-1-aws-east-1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.100.0.7:9200]} [2014-12-15 13:04:49,455][INFO ][node ] [elasticsearch-1-aws-east-1] started [2014-12-15 13:04:49,483][INFO ][gateway ] [elasticsearch-1-aws-east-1] recovered [0] indices into cluster_state [2014-12-15 13:04:57,190][INFO ][cluster.metadata ] [elasticsearch-1-aws-east-1] [.marvel-2014.12.15] creating index, cause [auto(bulk api)], shards [1]/[1], mappings [indices_stats, cluster_stats, node_stats, shard_event, node_event, index_event, index_stats, default, cluster_state, cluster_event, routing_event] [2014-12-15 13:04:58,696][INFO ][cluster.metadata ] [elasticsearch-1-aws-east-1] [.marvel-2014.12.15] update_mapping node_stats [2014-12-15 13:04:59,221][INFO ][cluster.metadata ] [elasticsearch-1-aws-east-1] [.marvel-2014.12.15] update_mapping shard_event [2014-12-15 13:04:59,387][INFO ][cluster.metadata ] [elasticsearch-1-aws-east-1] [.marvel-2014.12.15] update_mapping index_event [2014-12-15 13:04:59,430][INFO ][cluster.metadata ] [elasticsearch-1-aws-east-1] [.marvel-2014.12.15] update_mapping node_event [2014-12-15 13:04:59,484][INFO ][cluster.metadata ] [elasticsearch-1-aws-east-1] [.marvel-2014.12.15] update_mapping cluster_event [2014-12-15 13:04:59,552][INFO ][cluster.metadata ] [elasticsearch-1-aws-east-1] [.marvel-2014.12.15] update_mapping routing_event [2014-12-15 13:04:59,613][INFO ][cluster.metadata ] [elasticsearch-1-aws-east-1] [.marvel-2014.12.15] update_mapping cluster_state [2014-12-15 13:04:59,901][INFO ][cluster.metadata ] [elasticsearch-1-aws-east-1] [.marvel-2014.12.15] update_mapping indices_stats [2014-12-15 13:05:00,175][INFO ][cluster.metadata ] [elasticsearch-1-aws-east-1] [.marvel-2014.12.15] update_mapping index_stats [2014-12-15 13:05:00,295][INFO ][cluster.metadata ] [elasticsearch-1-aws-east-1] [.marvel-2014.12.15] update_mapping cluster_stats [2014-12-15 13:05:00,753][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] received ping_response{node [[elasticsearch-2-aws-east-1][Onsbf07bRSah1d--bSJhsg][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.10:9300]]{cluster.name=aws-east-1}], id[8], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} [2014-12-15 13:05:00,756][WARN ][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] received ping response ping_response{node [[elasticsearch-2-aws-east-1][Onsbf07bRSah1d--bSJhsg][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.10:9300]]{cluster.name=aws-east-1}], id[8], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} with no matching id [1] [2014-12-15 13:05:00,753][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] received ping_response{node [[elasticsearch-2-aws-east-1][Onsbf07bRSah1d--bSJhsg][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.10:9300]]{cluster.name=aws-east-1}], id[9], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} [2014-12-15 13:05:00,759][WARN ][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] received ping response ping_response{node [[elasticsearch-2-aws-east-1][Onsbf07bRSah1d--bSJhsg][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.10:9300]]{cluster.name=aws-east-1}], id[9], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} with no matching id [1] [2014-12-15 13:05:00,761][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] received ping_response{node [[elasticsearch-2-aws-east-1][Onsbf07bRSah1d--bSJhsg][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.10:9300]]{cluster.name=aws-east-1}], id[7], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} [2014-12-15 13:05:00,762][WARN ][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] received ping response ping_response{node [[elasticsearch-2-aws-east-1][Onsbf07bRSah1d--bSJhsg][elasticsearch.aws-east-1.weave.local][inet[/10.100.0.10:9300]]{cluster.name=aws-east-1}], id[7], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} with no matching id [1] [2014-12-15 13:05:01,230][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] received ping_response{node [[elasticsearch-1-azure-1][ZG0tJyHbRUqutFWFxDNiuw][elasticsearch.azure-1.weave.local][inet[/10.100.0.13:9300]]{cluster.name=azure-1}], id[31], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} [2014-12-15 13:05:01,233][WARN ][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] received ping response ping_response{node [[elasticsearch-1-azure-1][ZG0tJyHbRUqutFWFxDNiuw][elasticsearch.azure-1.weave.local][inet[/10.100.0.13:9300]]{cluster.name=azure-1}], id[31], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} with no matching id [1] [2014-12-15 13:05:01,235][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] received ping_response{node [[elasticsearch-1-azure-1][ZG0tJyHbRUqutFWFxDNiuw][elasticsearch.azure-1.weave.local][inet[/10.100.0.13:9300]]{cluster.name=azure-1}], id[32], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} [2014-12-15 13:05:01,237][WARN ][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] received ping response ping_response{node [[elasticsearch-1-azure-1][ZG0tJyHbRUqutFWFxDNiuw][elasticsearch.azure-1.weave.local][inet[/10.100.0.13:9300]]{cluster.name=azure-1}], id[32], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} with no matching id [1] [2014-12-15 13:05:01,241][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] received ping_response{node [[elasticsearch-1-azure-1][ZG0tJyHbRUqutFWFxDNiuw][elasticsearch.azure-1.weave.local][inet[/10.100.0.13:9300]]{cluster.name=azure-1}], id[33], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} [2014-12-15 13:05:01,242][WARN ][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] received ping response ping_response{node [[elasticsearch-1-azure-1][ZG0tJyHbRUqutFWFxDNiuw][elasticsearch.azure-1.weave.local][inet[/10.100.0.13:9300]]{cluster.name=azure-1}], id[33], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} with no matching id [1] [2014-12-15 13:05:01,669][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] received ping_response{node [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], id[28], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} [2014-12-15 13:05:01,673][WARN ][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] received ping response ping_response{node [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], id[28], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} with no matching id [1] [2014-12-15 13:05:01,683][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] received ping_response{node [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], id[29], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} [2014-12-15 13:05:01,686][WARN ][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] received ping response ping_response{node [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], id[29], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} with no matching id [1] [2014-12-15 13:05:01,692][TRACE][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] [1] received ping_response{node [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], id[30], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} [2014-12-15 13:05:01,695][WARN ][discovery.zen.ping.multicast] [elasticsearch-1-aws-east-1] received ping response ping_response{node [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], id[30], master [[elasticsearch-2-azure-1][DgXZvGCXTUCM6YtU2l0JjQ][elasticsearch.azure-1.weave.local][inet[/10.100.0.12:9300]]{cluster.name=azure-1}], hasJoinedOnce [true], cluster_name[multicloud]} with no matching id [1]

bleskes commented 9 years ago

the messages means that the ping responses comes in too late. By default we wait up to 3s for multicast ping responses to come back. Anything above that will indeed be ignored. This is the nature of a ping - you use to discover what's out there, so you can't wait forever. As a side note - try using unicast in production environment - it's more reliable.

I'm not sure what you're doing exactly, as you seem to be connecting EC2 based nodes with Azure? This is not a good idea because ES needs a good low latency network between nodes. Depending on what you are trying to achieve, there might be better ways to achieve this.

yaronr commented 9 years ago

Hey @bleskes, thanks for the quick response. I've increased --discovery.zen.ping_timeout=15, let's see how that goes.

But implementation-wise, isn't discovery a continuous process? IMO, at any time another node of the same cluster is discovered (for whichever reason), that needs to be processed the same way.

Regarding your suggestions: I am working on a multi-cloud cluster. I'm running ElasticSearch on top of Docker, on top of CoreOS, over a SDN. It's currently used for logs / metric aggregation and experimentation. I am setting --node.cluster.name=${CLUSTER_NAME} and --cluster.routing.allocation.awareness.attributes=cluster.name which seems to affect the shading (routing) exactly as I would expect.

I have to say that I am VERY impressed with the resilience of ES, it is truly an example of a 'just works' implementation, hiding some very complex issues and handling them very well. The things I have done to it should have caused data loss and more - which never happened. "no data was harmed during the making of this experiment" :). I know that's very hard to do, so kudos.

I have the following questions: 1) Why is it not a good idea to have high-latency between sub-clusters (as per my described config) ? 2) I read the 'why not use multicast' comment on your site, and although I understand and agree - in my case, 'everything is ephemeral' - hosts can come and go, services definitely - and their IPs may change at any time. I could employ some clever discovery mechanism, but multicast works great (except for the ping timeout thing)

Cheers, Yaron

bleskes commented 9 years ago

But implementation-wise, isn't discovery a continuous process? IMO, at any time another node of the same cluster is discovered (for whichever reason), that needs to be processed the same way.

It works a bit differently. Every time a node needs to find the cluster or a new master in pings in search. Once it find the cluster or other nodes they elect a master if not there already and then all nodes join that master. From that point on, no discovery pinging is needed any more (we do have fault detection pings but those are different).

Settings the ping timeout to 15s will slow down cluster forming and master (re) election. FYI.

I know that's very hard to do, so kudos.

Thx :)

1) Why is it not a good idea to have high-latency between sub-clusters (as per my described config) ?

In our experience the network between data centers (and certainly between providers) is not as stable as the local one. Although a properly configured cluster should be resilient to this, it doesn't end up giving you a good experience as requests time out or the cluster becomes un available until it reforms (you saw a taste of it in your pinging process).

I would recommend that you look at other alternatives, like setting up a cluster per cloud provider and use things like the tribe nodes to do cross cluster querying.

2) I read the 'why not use multicast' comment on your site...hosts can come and go, services definitely

With multi cast there are less guarantees about delivery and at the moment less gossiping and knowledge sharing on the ES side as well. It is really meant for development environments where nodes can easily find each other. Regarding the nodes go and come remark - we have plugins for all the common cloud providers that use the prover APIs to tell you which nodes are out there and what to ping. Note that you don't need all the nodes to be listed but a few (typically the dedicated master nodes in big deployments). Not all of them need to be available all the time as well.

clintongormley commented 9 years ago

Sounds like this has been resolved, so closing. Feel free to reopen

ajhalani commented 9 years ago

Since upgrading to v1.3.7(few months ago) from v1.2.2, our master re-elect time has gone from few milliseconds to 3seconds pause, plus ~0.5 seconds..

During this time, we sometimes see failures in ES request on other machines than the master and get error - {"error":"ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]","status":503}

We don't use multicast, but unicast. I think it's related to this ticket, but if not I will open a new issue. Thanks!

bleskes commented 9 years ago

@ajhalani this shouldn't be the behavior in 1.3.7 but is in 1.4.0 where we added a pinging round of 3s by default - see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#master-election for more info. Can you double check your ES version?

ajhalani commented 9 years ago

You are right, it's v1.4.1.. I would love some guidance if I am misreading it as an issue, or if it's really an issue should I open a new ticket?

bleskes commented 9 years ago

@ajhalani cool thx. If you have a timeout <~3s on the requests it will explain what you see. I suggest you increase it a bit.

ajhalani commented 9 years ago

Well the request (_bulk index) didn't timeout, it failed immediately. We don't set any timeout setting, using the default settings..

I feel the issue here is new version takes by default atleast 3 seconds to choose a new master (ping_interval * ping_retries), during which there is no master and write requests fail.. In our installation pre v1.4, it used to be much faster.. So seems like an issue to me, am I missing something?

bleskes commented 9 years ago

@ajhalani write requests shouldn't fail but wait for the 3s to complete (read requests are not blocked). The default timeout is 1m, did you change it? I also suggest that page I referred to - it explains why added the 3s gossiping round.

ajhalani commented 9 years ago

We didn't change the timeout. The indexing request didn't wait for timeout, so it looks like a timing issue "Timeline" -

[2015-02-25 04:03:31,926][DEBUG][transport.netty ] [ny1.node] disconnecting from [[nj2.node][RGXiGOUIQYSkaGhPX1dDDg][nj2][inet[/10.126.159.169:9301]]{datacenter=nj, master=true}] due to explicit disconnect call [2015-02-25 04:03:31,927][DEBUG][cluster.service ] [ny1.node] processing [zen-disco-master_failed ([nj2.node][RGXiGOUIQYSkaGhPX1dDDg][nj2][inet[/10.126.159.169:9301]]{datacenter=nj, master=true})]: done applying updated cluster_state (version: 30470)

25FEB2015_04:03:33.989 - index request is sent and fails in 1-2 ms(no timeout) with http status code 503, response: {"error":"ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]","status":503}

[2015-02-25 04:03:34,951][DEBUG][cluster.service ] [ny1.node] failing [zen-disco-receive(join from node[[nj1.node][aTvj8H3bQhuWoNfXikr78Q][nj1][inet[/10.126.151.178:9301]]{datacenter=nj, master=true}])]: local node is no longer master

[2015-02-25 04:03:35,594][DEBUG][cluster.service ] [ny1.node] cluster state updated, version [30471], source [zen-disco-join (elected_as_master)] [2015-02-25 04:03:35,594][INFO ][cluster.service ] [ny1.node] new_master [ny1.node][PoLRwDtNTw2gEgV9Uz3cuA][ny1][inet[/10.126.23.55:9301]]{datacenter=ny, master=true}, reason: zen-disco-join (elected_as_master)

I will create a separate thread for this, I feel like hijacking a closed thread..