elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.42k stars 24.57k forks source link

Elastic Search Cluster becomes unresponsive #6409

Closed nimahajan closed 9 years ago

nimahajan commented 10 years ago

Hi,

I have been facing this issue for couple of days. Here are the details -

I have a cluster of 7 nodes, with 3 master eligible with no data and 4 only data nodes. I have set minimum master node property to 2. I have used KOPF/ElasticHQ/Head for monitoring. I have set up the cluster for testing so there is hardly any querying to going on. So it remains idle most of the time. After few hours of inactivity, the plugins cannot show the any details except for the green status (no other details like shard etc shown). If I make any search query, it does not work.

e.g. http://ironman:9200/_search?q=chrome

I have to restart the master's elasticsearch service to make it responsive again. I checked the $ tail -f /var/log/elasticsearch/avengers.log I don't see any failure/exceptions. I had elastic search 1.1 and updated to 1.2 as well. But I am still having this issue.

Please let me know if you need more details.

Thanks, Nilesh.

clintongormley commented 10 years ago

Hi @nimahajan

Are you still experiencing this problem? What version? There must be something going on in the logs. Without more info it's difficult to help.

psych0d0g commented 10 years ago

I have the same issue of a cluster becoming unresposive after a random amount of time, Im running with 4 nodes (all of them are master eligable and data nodes) in a Datacenter Aware Setup with 5 shards/index and 1 Replica in the other Datacenter. My logfiles are quiet aswell when this happens, ES just stops logging at all at the same time the cluster is getting unresponsive. Last logline i get is from yesterday:

2014-08-19 16:48:48,390][INFO ][cluster.service ] [csliveeubs-elasticsearch01.server.lan] removed {[csliveeubap-logparser01.server.lan][81xwRvieT4ytrMO2PdJXZg][csliveeubap-logparser01.server.lan][inet[/...:9300]]{client=true, data=false},}, added {[csliveeubap-logparser01.server.lan][I1t_-leETwqdSH_Xt-0Ymg][csliveeubap-logparser01.server.lan][inet[/...:9300]]{client=true, data=false},}, reason: zen-disco-receive(from master [[csliveeubap-elasticsearch02.server.lan][TgTzuqrkRYay6fOyDIP66w][csliveeubap-elasticsearch02.server.lan][inet[/...:9300]]{datacenter_location=2, master=true}])

Im running ES 1.3.1 on debian wheezy x64 on the following Hardware Setup without the use of any plugins:

2x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz 24GB DDR3 ecc ram 2.7T raid 5 with 3x600GB 10k sas HDDs

clintongormley commented 10 years ago

@psych0d0g that is your client shutting down and being replaced with another client. Also, we strongly advise against using a single cluster across two data centres. It will produce instability and could well be responsible for what you are seeing.

What is in the logs of your other nodes?

psych0d0g commented 10 years ago

Exactly the same for all data-node logs. Our datacenters are linked with a dedicated connection via fibre. The ping time between them is about 0.9ms Additionally here is my ES Configuration:

cluster.name: cslive-elasticsearch node.name: "csliveeubs-elasticsearch01.server.lan" node.datacenter_location: 1 node.master: true node.data: true path.data: /mnt/storage cluster.routing.allocation.awareness.attributes: datacenter_location cluster.routing.allocation.cluster_concurrent_rebalance: 12 cluster.routing.allocation.node_concurrent_recoveries: 15 indices.recovery.concurrent_streams: 12 cluster.routing.allocation.node_initial_primaries_recoveries: 5

Node in the other datacenter, diffrent vlan:

discovery.zen.ping.unicast.hosts: ... discovery.zen.minimum_master_nodes: 2 index.number_of_replicas: 1 index.recovery.initial_shards: 2 indices.memory.index_buffer_size: 50% index.translog.flush_threshold_ops: 50000 replication: async index.compound_on_flush: false index.compound_format: false indices.recovery.concurrent_streams: 12 indices.recovery.max_bytes_per_sec: 512mb indices.store.throttle.max_bytes_per_sec: 512mb action.disable_delete_all_indices: true

Search thread pool

threadpool.search.type: fixed threadpool.search.size: 20 threadpool.search.queue_size: 100

Index thread pool

threadpool.index.type: fixed threadpool.index.size: 600 threadpool.index.queue_size: 2000

clintongormley commented 10 years ago

Some notes on your settings:

index.translog.flush_threshold_ops: 50000

In v1.3.x we set flush_threshold_ops to unbounded. You should do the same.

replication: async

This stops Elasticsearch from exerting backpressure if it is struggling. Use sync replication.

index.compound_on_flush: false
index.compound_format: false

Why are you disabling the compound format? This should not be necessary, especially if you have an unbounded flush_threshold_ops

threadpool.index.size: 600

You're going to overwhelm Elasticsearch with so many indexing threads. Use the default settings instead.

And back to your original issue - you say the cluster becomes unresponsive, but what do you mean by that? Do no nodes respond at all, even to a simple GET / request?

clintongormley commented 10 years ago

There is nothing else at all in any of your other logs?

psych0d0g commented 10 years ago

the logs arent telling me anything for the time when the cluster becomes "unresposive". When im using curl to for eg:

   curl -XPUT localhost:9200/_cluster/settings -d '{
            "persistent" : {
            "cluster.routing.allocation.disable_allocation" : true
            }
    }'

The cluster responds with a 503 timeout

{"error":"RemoteTransportException[[csliveeubs-elasticsearch01.server.lan][inet[/...:9300]][cluster/settings/update]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (cluster_update_settings) within 30s]; ","status":503}

clintongormley commented 10 years ago

How big is your cluster state? What's the output of:

curl 'localhost:9200/_cat/pending_tasks?v'
psych0d0g commented 10 years ago

at this moment its freshly restartet to bring it back to a working state asapst because our queue are filling up pretty fast, we need to get them empty again, i will have a look when this happens again.

psych0d0g commented 10 years ago

okay, here is the pending tasks, zen-disco-recive piled up like hell:

https://gist.github.com/psych0d0g/8e4e0d140b9d95946df4

clintongormley commented 10 years ago

@psych0d0g have you tried using the elasticsearch-http connector in logstash?

psych0d0g commented 10 years ago

not yet since my thought was that its not as performant as the node protocol additionally with the node protocol i dont have to care about distributing the load myself accross the cluster

clintongormley commented 10 years ago

So when a client node joins the cluster, it has to go through "discovery" which can take a little time. The idea is that the client node will be part of the cluster for an extended period. It looks like, for whatever reason, you are either creating lots and lots of connections or they are very short lived.

You probably want to ask about your logstash config on the logstash forums, but I'd definitely try out the http connector, which is the recommended connector these days.

psych0d0g commented 10 years ago

mhm, i just did, and it IS slow, with the node protocoll i managed to index about 20k loglines/s with the http api im down to 12k/s

psych0d0g commented 10 years ago

and the problem might be a combination of both, i use 24 workes for the ES output and i restart the logstashes constantly when introducing new grok filter. That might be the root cause indeed

clintongormley commented 10 years ago

Perhaps the logstash guys can give you some tuning advice. Btw, with the work that is happening in the improve_zen branch on Elasticsearch, discovery will become much quicker than it is today.

psych0d0g commented 10 years ago

okay, now the issue is back again, even the /_cat/ request seems to timeout right now so i cant check whether my pending tasks increased at least i get a bunch of stacktrace's in my logfiles now: https://gist.github.com/psych0d0g/73a52ca35405236c9406

psych0d0g commented 10 years ago

Here are some more, from yesterday at around the time the last loglines where indexed:

[2014-08-26 20:17:37,892][DEBUG][action.admin.cluster.node.stats] [csliveeubs-elasticsearch01.server.lan] failed to execute on node [MM0GwoWYTSqBWD3VeFiaiw] org.elasticsearch.transport.RemoteTransportException: [csliveeubap-elasticsearch01.server.lan][inet[/10.76.30.11:9300]][cluster/nodes/stats/n] Caused by: org.elasticsearch.index.engine.EngineException: [logstash-live-2014.08.26][4] failed to acquire searcher, source doc_stats at org.elasticsearch.index.engine.internal.InternalEngine.acquireSearcher(InternalEngine.java:713) at org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:653) at org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:647) at org.elasticsearch.index.shard.service.InternalIndexShard.docStats(InternalIndexShard.java:498) at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:127) at org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49) at org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:209) at org.elasticsearch.node.service.NodeService.stats(NodeService.java:156) at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:95) at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43) at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:277) at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:268) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.lucene.store.AlreadyClosedException: this ReferenceManager is closed at org.apache.lucene.search.ReferenceManager.acquire(ReferenceManager.java:98) at org.elasticsearch.index.engine.internal.InternalEngine.acquireSearcher(InternalEngine.java:700) ... 15 more [2014-08-26 20:19:26,109][WARN ][cluster.action.shard ] [csliveeubs-elasticsearch01.server.lan] [logstash-live-2014.08.26][4] received shard failed for [logstash-live-2014.08.26][4], node[MM0GwoWYTSqBWD3VeFiaiw], [R], s[STARTED], indexUUID [V5aS0Y7ZRveaReA_huF-KA], reason [engine failure, message [merge exception][MergeException[java.lang.OutOfMemoryError: Java heap space]; nested: OutOfMemoryError[Java heap space]; ]]

ES has 12GB of heap space, how the heck can it eat all that precious RAM up and demand even more for its heap?

clintongormley commented 10 years ago

@psych0d0g I'm guessing that you are sorting or aggregating on some fields which are taking up a lot of memory? Have a look at the output of:

GET '/_nodes/stats/indices/fielddata?human&fields=*'
psych0d0g commented 10 years ago

{"cluster_name":"cslive-elasticsearch","nodes":{"vCAZom1MQF2XZGNqqtqtew":{"timestamp":1409733531784,"name":"csliveeubs-elasticsearch02.server.lan","transport_address":"inet[/10.79.192.12:9300]","host":"csliveeubs-elasticsearch02.server.lan","ip":["inet[/10.79.192.12:9300]","NONE"],"attributes":{"datacenter_location":"bs","master":"true"},"indices":{"fielddata":{"memory_size":"103.7mb","memory_size_in_bytes":108769547,"evictions":0,"fields":{"tags":{"memory_size":"74.6kb","memory_size_in_bytes":76485},"@timestamp":{"memory_size":"103.5mb","memory_size_in_bytes":108570376},"_uid":{"memory_size":"119.8kb","memory_size_in_bytes":122686}}}}},"eMjtqKXER4ilutu9UPwwmw":{"timestamp":1409733531779,"name":"csliveeubap-elasticsearch02.server.lan","transport_address":"inet[/10.76.30.12:9300]","host":"csliveeubap-elasticsearch02.server.lan","ip":["inet[/10.76.30.12:9300]","NONE"],"attributes":{"datacenter_location":"bap","master":"true"},"indices":{"fielddata":{"memory_size":"194.5kb","memory_size_in_bytes":199171,"evictions":0,"fields":{"_uid":{"memory_size":"119.8kb","memory_size_in_bytes":122686},"tags":{"memory_size":"74.6kb","memory_size_in_bytes":76485}}}}},"g2nl3aZJSeqhRZMaUdpbsA":{"timestamp":1409733531783,"name":"csqaeubs-logparser02.server.lan","transport_address":"inet[/10.88.7.52:9300]","host":"csqaeubs-logparser02.server.lan","ip":["inet[/10.88.7.52:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"2NuW6V_LQLieGaRLlFXipQ":{"timestamp":1409733532930,"name":"csliveeubap-elasticsearch01.server.lan","transport_address":"inet[/10.76.30.11:9300]","host":"csliveeubap-elasticsearch01.server.lan","ip":["inet[/10.76.30.11:9300]","NONE"],"attributes":{"datacenter_location":"bap","master":"true"},"indices":{"fielddata":{"memory_size":"96.7mb","memory_size_in_bytes":101478287,"evictions":0,"fields":{"tags":{"memory_size":"149.3kb","memory_size_in_bytes":152936},"@timestamp":{"memory_size":"96.5mb","memory_size_in_bytes":101220616},"_uid":{"memory_size":"102.2kb","memory_size_in_bytes":104735}}}}},"fQIrgXboSWyZ6IgZ19_BPw":{"timestamp":1409733531783,"name":"csqaeubs-logparser01.server.lan","transport_address":"inet[/10.88.7.51:9300]","host":"csqaeubs-logparser01.server.lan","ip":["inet[/10.88.7.51:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"U8TV4j2ETYCuf4Y8HW4zYQ":{"timestamp":1409733531780,"name":"csqaeubap-logparser01.server.lan","transport_address":"inet[/10.88.32.51:9300]","host":"csqaeubap-logparser01.server.lan","ip":["inet[/10.88.32.51:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"XiaE-n4_Rp6ZvZ8mj74D1w":{"timestamp":1409733531780,"name":"csqaeubap-logparser02.server.lan","transport_address":"inet[/10.88.32.52:9300]","host":"csqaeubap-logparser02.server.lan","ip":["inet[/10.88.32.52:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"krlSSP8JSneA6Lqu4xdR-A":{"timestamp":1409733531783,"name":"csliveeubs-elasticsearch01.server.lan","transport_address":"inet[/10.79.192.11:9300]","host":"csliveeubs-elasticsearch01.server.lan","ip":["inet[/10.79.192.11:9300]","NONE"],"attributes":{"datacenter_location":"bs","master":"true"},"indices":{"fielddata":{"memory_size":"235.8mb","memory_size_in_bytes":247358263,"evictions":0,"fields":{"tags":{"memory_size":"149.3kb","memory_size_in_bytes":152936},"@timestamp":{"memory_size":"235.6mb","memory_size_in_bytes":247100592},"_uid":{"memory_size":"102.2kb","memory_size_in_bytes":104735}}}}}}}

clintongormley commented 10 years ago

OK, so it doesn't look like your fielddata, as that is not using much memory (or did you restart nodes in the interim).

How big are your bulk batches? Are you still using so many indexing threads? Please add the full output of these two commands (in a separate gist):

GET /_nodes
GET /_nodes/stats
psych0d0g commented 10 years ago

It took the cluster several minutes to respond with the requested data

GET /_nodes: https://gist.github.com/psych0d0g/9a19380a5d5241002f55

GET /_nodes/stats: https://gist.github.com/psych0d0g/524c8279e4bc5de161c4

another GET '/_nodes/stats/indices/fielddata?human&fields=*': https://gist.github.com/psych0d0g/1119212c262ef2026ba7

my batches are: 1000 my indexing threads have ben removed from the config so im back to default, im restarting the cluster daily atm to get it back to performance

clintongormley commented 10 years ago

Hi @psych0d0g

The only things that show up as abnormal are (1) slow garbage collections and (2) the amount of RAM being used for your segments. See indices.segments.memory_in_bytes in the stats output.

I suggest disabling bloom filters on old indices, and disabling norms (or making fields not_analyzed) on any string field which is not used for full text search.

See:

clintongormley commented 9 years ago

I'm assuming that this advice worked, as we haven't heard any more, so I'll close this ticket