Long time for elect new master after existing leader unavailable

svagner commented 5 years ago

Elasticsearch version: 7.0.1 (e4efcb5)

Plugins installed: []

JVM version: build 1.8.0_191-b12

OS version: Linux masterv7-1003 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior: Long time for elect new master when existed master became unavailable (> 1m) Also weird record in logs: sourceNode={masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360, xpack.installed=true, ml.max_open_jobs=20}, targetNode={masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA} Looks like we try to connect to ourself...

Steps to reproduce:

We have 5 master-eligible nodes:

10.198.26.193   3 27 0 0.01 0.03 0.05 mi - masterv7-1005
10.197.180.144  5 50 3 0.50 0.17 0.09 mi - masterv7-1003
10.197.26.30    4 27 0 0.00 0.02 0.05 mi - masterv7-1002
10.197.217.168  8 27 0 0.03 0.03 0.05 mi - masterv7-1001
10.197.179.107  7 27 0 0.00 0.03 0.06 mi * masterv7-1004

All nodes are located in same local network: rtt:0.133/0.098

run command on masterv7-1004:

echo c > /proc/sysrq-trigger

Time when cluster will elect new master and cluster became available > 1min.

logs:

2019-06-07T13:00:33.657+02:00 masterv7-1003 elasticsearch[14510] failed to connect to node {masterv7-1004}{90fry9MjQjuXhkCFF8OkaA}{sKxWCTTIQfyhcDyd8dJwUg}{10.197.179.107}{10.197.179.107:9300}{ml.machine_memory=134836666368, ml.max_open_jobs=20, xpack.installed=true} (tried [1] times)
2019-06-07T13:00:33.660+02:00 masterv7-1003 elasticsearch[14510] cluster state applier task [becoming candidate: onLeaderFailure] took [30s] which is above the warn threshold of 30s
2019-06-07T13:00:36.136+02:00 masterv7-1003 elasticsearch[14510] master node changed {previous [], current [{masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360, xpack.installed=true, ml.max_open_jobs=20}]}, term: 31, version: 841, reason: Publication{term=31, version=841}
2019-06-07T13:01:06.055+02:00 masterv7-1003 elasticsearch[14510] failed to join {masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360, xpack.installed=true, ml.max_open_jobs=20} with JoinRequest{sourceNode={masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=31, lastAcceptedTerm=29, lastAcceptedVersion=840, sourceNode={masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360, xpack.installed=true, ml.max_open_jobs=20}, targetNode={masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360,  xpack.installed=true, ml.max_open_jobs=20}}]}
2019-06-07T13:01:06.055+02:00 masterv7-1003 elasticsearch[14510] failed to join {masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360, xpack.installed=true, ml.max_open_jobs=20} with JoinRequest{sourceNode={masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=31, lastAcceptedTerm=29, lastAcceptedVersion=840, sourceNode={masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360, xpack.installed=true, ml.max_open_jobs=20}, targetNode={masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360, xpack.installed=true, ml.max_open_jobs=20}}]}
...
2019-06-07T13:01:33.723+02:00 masterv7-1003 elasticsearch[14510] cluster state applier task [Publication{term=31, version=841}] took [57.5s] which is above the warn threshold of 30s
2019-06-07T13:01:33.731+02:00 masterv7-1003 elasticsearch[14510] Received response for a request that has timed out, sent [87590ms] ago, timed out [27668ms] ago, action [internal:cluster/coordination/join], node [{masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360, xpack.installed=true, ml.max_open_jobs=20}], id [303]
2019-06-07T13:01:33.732+02:00 masterv7-1003 elasticsearch[14510] cluster state update task [elected-as-master ([3] nodes joined)[{masterv7-1005}{Cz1NUYmESjW3NTz_tYE8ag}{rKALbvKfSPOko0P5Ktlelw}{10.198.26.193}{10.198.26.193:9300}{ml.machine_memory=134836662272, ml.max_open_jobs=20, xpack.installed=true} elect leader, {masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{ml.machine_memory=134835855360, xpack.installed=true, ml.max_open_jobs=20} elect leader, {masterv7-1001}{WQKwhB2aQkSl-SquZhfVdA}{kn3ADfzAQQ6CoPcw90eoMg}{10.197.217.168}{10.197.217.168:9300}{ml.machine_memory=134835867648, ml.max_open_jobs=20, xpack.installed=true} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_]] took [1.4m] which is above the warn threshold of 30s
2019-06-07T13:01:33.771+02:00 masterv7-1003 elasticsearch[14510] node-left[{masterv7-1004}{90fry9MjQjuXhkCFF8OkaA}{sKxWCTTIQfyhcDyd8dJwUg}{10.197.179.107}{10.197.179.107:9300}{ml.machine_memory=134836666368, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 31, version: 842, reason: removed {{masterv7-1004}{90fry9MjQjuXhkCFF8OkaA}{sKxWCTTIQfyhcDyd8dJwUg}{10.197.179.107}{10.197.179.107:9300}{ml.machine_memory=134836666368, ml.max_open_jobs=20, xpack.installed=true},}
2019-06-07T13:02:03.777+02:00 masterv7-1003 elasticsearch[14510] removed {{masterv7-1004}{90fry9MjQjuXhkCFF8OkaA}{sKxWCTTIQfyhcDyd8dJwUg}{10.197.179.107}{10.197.179.107:9300}{ml.machine_memory=134836666368, ml.max_open_jobs=20, xpack.installed=true},}, term: 31, version: 842, reason: Publication{term=31, version=842}
2019-06-07T13:02:03.784+02:00 masterv7-1003 elasticsearch[14510] cluster state update task [node-left[{masterv7-1004}{90fry9MjQjuXhkCFF8OkaA}{sKxWCTTIQfyhcDyd8dJwUg}{10.197.179.107}{10.197.179.107:9300}{ml.machine_memory=134836666368, ml.max_open_jobs=20, xpack.installed=true} disconnected]] took [30s] which is above the warn threshold of 30s

elasticmachine commented 5 years ago

Pinging @elastic/es-distributed

DaveCTurner commented 5 years ago

This sounds like the issue fixed by #39629. Are you running your nodes in containers? Can you set

logger.org.elasticsearch.cluster.service: TRACE

and reproduce the problem, then share the full logs including stack traces?

Looks like we try to connect to ourself...

That message looks normal to me.

DaveCTurner commented 5 years ago

Also, what is the response from cat /proc/sys/net/ipv4/tcp_retries2?

svagner commented 5 years ago

Hi! It is running in common NS (not in container).
net.ipv4.tcp_retries2 = 5 But I'm not sure that it is related with any retransmission (I didn't notice any abnormal retransmission rate). There is tcp info for connection between 2 masters:

     cubic wscale:8,8 rto:201 rtt:0.218/0.02 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:16 bytes_acked:15563815 bytes_received:16633592 segs_out:28836 segs_in:28172 send 850.2Mbps lastsnd:40785421 lastrcv:40785420 lastack:27003 pacing_rate 1695.5Mbps rcv_rtt:2.5 rcv_space:29200
     cubic wscale:8,8 rto:201 rtt:0.212/0.017 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:16 bytes_acked:15692475 bytes_received:16783234 segs_out:28952 segs_in:28287 send 874.3Mbps lastsnd:40725644 lastrcv:40725643 lastack:27387 pacing_rate 1745.4Mbps rcv_rtt:1.875 rcv_space:29200
     cubic wscale:8,8 rto:201 rtt:0.217/0.025 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:16 bytes_acked:16873298 bytes_received:18031235 segs_out:29888 segs_in:29167 send 854.1Mbps lastsnd:10894 lastrcv:10889 lastack:10890 pacing_rate 1708.2Mbps rcv_rtt:2.125 rcv_space:29200
     cubic wscale:8,8 rto:203 rtt:2.775/1.396 ato:40 mss:1448 rcvmss:608 advmss:1448 cwnd:10 bytes_acked:388185 bytes_received:74851 segs_out:17134 segs_in:16893 send 41.7Mbps lastsnd:37115932 lastrcv:37115930 lastack:27388 pacing_rate 83.5Mbps rcv_rtt:5 rcv_space:29200
     cubic wscale:8,8 rto:201 rtt:0.23/0.021 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:16 bytes_acked:17047770 bytes_received:18223325 segs_out:29979 segs_in:29253 send 805.8Mbps lastsnd:45092 lastrcv:45090 lastack:15035 pacing_rate 1609.9Mbps rcv_rtt:2.125 rcv_space:29200
     cubic wscale:8,8 rto:200 rtt:0.09/0.045 mss:1448 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:16594 segs_in:16593 send 1287.1Mbps lastsnd:249555262 lastrcv:249951356 lastack:11644 pacing_rate 2574.2Mbps rcv_space:29200
     cubic wscale:8,8 rto:201 rtt:0.229/0.012 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:16 bytes_acked:16928763 bytes_received:18083291 segs_out:29929 segs_in:29200 send 809.4Mbps lastsnd:45051 lastrcv:45048 lastack:14972 pacing_rate 1616.1Mbps rcv_rtt:2.375 rcv_space:29200
     cubic wscale:8,8 rto:200 rtt:0.062/0.031 mss:1448 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:16594 segs_in:16593 send 1868.4Mbps lastsnd:249555262 lastrcv:249951356 lastack:11644 pacing_rate 3736.8Mbps rcv_space:29200
     cubic wscale:8,8 rto:201 rtt:0.338/0.033 ato:40 mss:1448 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:90762673 bytes_received:5236308 segs_out:498700 segs_in:249372 send 342.7Mbps lastsnd:741 lastrcv:741 lastack:741 pacing_rate 683.7Mbps rcv_rtt:329129 rcv_space:29463
     cubic wscale:8,8 rto:201 rtt:0.068/0.027 ato:40 mss:1448 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:59 bytes_received:25 segs_out:16596 segs_in:16595 send 1703.5Mbps lastsnd:249555262 lastrcv:249555206 lastack:11645 pacing_rate 3407.1Mbps rcv_space:29200
     cubic wscale:8,8 rto:201 rtt:0.216/0.02 ato:40 mss:1448 rcvmss:1448 advmss:1448 cwnd:16 bytes_acked:15550449 bytes_received:16619443 segs_out:28814 segs_in:28151 send 858.1Mbps lastsnd:40725638 lastrcv:40725635 lastack:27391 pacing_rate 1714.2Mbps rcv_rtt:2 rcv_space:29200

I'll be able to reproduce issue with trace logging tomorrow. Thanks a lot for your help!

svagner commented 5 years ago

For more information there is config for master nodes: /etc/elasticsearch/elasticsearch.yml

cluster.name: search7
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
cluster.initial_master_nodes: ["masterv7-1001", "masterv7-1002", "masterv7-1003", "masterv7-1004", "masterv7-1005"]
node.name: masterv7-1003
node.attr.rack: Z0000004
node.attr.availability_zone: west
node.data: false
node.master: true
node.max_local_storage_nodes: 1
action.destructive_requires_name: true
bootstrap.memory_lock: true
bootstrap.system_call_filter: false
cluster.routing.allocation.awareness.attributes: rack
cluster.routing.allocation.allow_rebalance: indices_all_active
cluster.routing.allocation.node_initial_primaries_recoveries: 10
indices.recovery.max_bytes_per_sec: 100mb
cluster.routing.allocation.node_concurrent_incoming_recoveries: 15
cluster.routing.allocation.node_concurrent_outgoing_recoveries: 15
discovery.seed_providers: file
discovery.zen.master_election.ignore_non_master_pings: true
http.host: 127.0.0.1
transport.host: _site:ipv4_
indices.fielddata.cache.size: 70%
cluster.routing.allocation.disk.threshold_enabled: true
cluster.routing.allocation.disk.watermark.high: 90%
cluster.routing.allocation.enable: all
cluster.routing.allocation.cluster_concurrent_rebalance: 15
cluster.info.update.interval: 5m
thread_pool.write.queue_size: 1000
thread_pool.search.queue_size: 10000
cluster.fault_detection.leader_check.timeout: 2s
cluster.fault_detection.leader_check.retry_count: 3
search.max_buckets: 50000

/etc/elasticsearch/unicast_hosts.txt:

masterv7-1001
masterv7-1002
masterv7-1003
masterv7-1004
masterv7-1005

All DNS queries are coming to local dns cache.

DaveCTurner commented 5 years ago

cluster.fault_detection.leader_check.timeout: 2s
cluster.fault_detection.leader_check.retry_count: 3

I recommend against setting these away from the default. With your config, you will see an unnecessary master election if the master spends even a few seconds in GC.

net.ipv4.tcp_retries2 = 5

Good. This means the connection to the master should be detected as failed around 5-6 seconds after executing echo c > /proc/sysrq-trigger, triggering an immediate master election. IMO this is a better way to detect total master failure than cluster.fault_detection.leader_check because it is insensitive to application-level pauses like GC.

svagner commented 5 years ago

Hi! There is logs from nodes 1001,1002,1004,1005 (1003 was master, crashed) while:

$ es_cluster_settings 127.0.0.1:9200 '.transient.logger'
{
  "org": {
    "elasticsearch": {
      "cluster": {
        "service": "TRACE"
      }
    }
  }
}

https://gist.github.com/svagner/f803434e4e4251a98001f1ccc48342e4

DaveCTurner commented 5 years ago

Thanks. Your logs are ambiguous because they don't include stack traces, nor the name of the component emitting the log message. They are at least consistent with the problem solved by #39629; for instance, here's a 30-second wait before a "failed to connect" message.

2019-06-11T10:14:34.103+02:00 masterv7-1005 elasticsearch[24579] failed to join {masterv7-1001}{WQKwhB2aQkSl-SquZhfVdA}{CnPB2JStTWOyYD3lNJVyAw}{10.197.217.168}{10.197.217.168:9300}{availability_zone=west, ml.machine_memory=134835867648, rack=Z0000001, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={masterv7-1005}{Cz1NUYmESjW3NTz_tYE8ag}{NgpE8uGPTGaE1UjsBldrgg}{10.198.26.193}{10.198.26.193:9300}{availability_zone=west, ml.machine_memory=134836662272, rack=Z0000002, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=33, lastAcceptedTerm=31, lastAcceptedVersion=1002, sourceNode={masterv7-1005}{Cz1NUYmESjW3NTz_tYE8ag}{NgpE8uGPTGaE1UjsBldrgg}{10.198.26.193}{10.198.26.193:9300}{availability_zone=west, ml.machine_memory=134836662272, rack=Z0000002, xpack.installed=true, ml.max_open_jobs=20}, targetNode={masterv7-1001}{WQKwhB2aQkSl-SquZhfVdA}{CnPB2JStTWOyYD3lNJVyAw}{10.197.217.168}{10.197.217.168:9300}{availability_zone=west, ml.machine_memory=134835867648, rack=Z0000001, ml.max_open_jobs=20, xpack.installed=true}}]}
2019-06-11T10:15:03.815+02:00 masterv7-1005 elasticsearch[24579] failed to connect to node {masterv7-1003}{DJqWBwA4T9ebbdDBtIjwfQ}{Rrs73W8nRKOcgeaZeadbdA}{10.197.180.144}{10.197.180.144:9300}{availability_zone=west, ml.machine_memory=134835855360, rack=Z0000003, ml.max_open_jobs=20, xpack.installed=true} (tried [1] times)

Closing this as a duplicate of #29025, but if the problem persists after upgrading to 7.2.0 we can continue to investigate here.

elastic / elasticsearch

Long time for elect new master after existing leader unavailable #42983