Closed nimahajan closed 9 years ago
Hi @nimahajan
Are you still experiencing this problem? What version? There must be something going on in the logs. Without more info it's difficult to help.
I have the same issue of a cluster becoming unresposive after a random amount of time, Im running with 4 nodes (all of them are master eligable and data nodes) in a Datacenter Aware Setup with 5 shards/index and 1 Replica in the other Datacenter. My logfiles are quiet aswell when this happens, ES just stops logging at all at the same time the cluster is getting unresponsive. Last logline i get is from yesterday:
2014-08-19 16:48:48,390][INFO ][cluster.service ] [csliveeubs-elasticsearch01.server.lan] removed {[csliveeubap-logparser01.server.lan][81xwRvieT4ytrMO2PdJXZg][csliveeubap-logparser01.server.lan][inet[/...:9300]]{client=true, data=false},}, added {[csliveeubap-logparser01.server.lan][I1t_-leETwqdSH_Xt-0Ymg][csliveeubap-logparser01.server.lan][inet[/...:9300]]{client=true, data=false},}, reason: zen-disco-receive(from master [[csliveeubap-elasticsearch02.server.lan][TgTzuqrkRYay6fOyDIP66w][csliveeubap-elasticsearch02.server.lan][inet[/...:9300]]{datacenter_location=2, master=true}])
Im running ES 1.3.1 on debian wheezy x64 on the following Hardware Setup without the use of any plugins:
2x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz 24GB DDR3 ecc ram 2.7T raid 5 with 3x600GB 10k sas HDDs
@psych0d0g that is your client shutting down and being replaced with another client. Also, we strongly advise against using a single cluster across two data centres. It will produce instability and could well be responsible for what you are seeing.
What is in the logs of your other nodes?
Exactly the same for all data-node logs. Our datacenters are linked with a dedicated connection via fibre. The ping time between them is about 0.9ms Additionally here is my ES Configuration:
cluster.name: cslive-elasticsearch node.name: "csliveeubs-elasticsearch01.server.lan" node.datacenter_location: 1 node.master: true node.data: true path.data: /mnt/storage cluster.routing.allocation.awareness.attributes: datacenter_location cluster.routing.allocation.cluster_concurrent_rebalance: 12 cluster.routing.allocation.node_concurrent_recoveries: 15 indices.recovery.concurrent_streams: 12 cluster.routing.allocation.node_initial_primaries_recoveries: 5
discovery.zen.ping.unicast.hosts: ... discovery.zen.minimum_master_nodes: 2 index.number_of_replicas: 1 index.recovery.initial_shards: 2 indices.memory.index_buffer_size: 50% index.translog.flush_threshold_ops: 50000 replication: async index.compound_on_flush: false index.compound_format: false indices.recovery.concurrent_streams: 12 indices.recovery.max_bytes_per_sec: 512mb indices.store.throttle.max_bytes_per_sec: 512mb action.disable_delete_all_indices: true
threadpool.search.type: fixed threadpool.search.size: 20 threadpool.search.queue_size: 100
threadpool.index.type: fixed threadpool.index.size: 600 threadpool.index.queue_size: 2000
Some notes on your settings:
index.translog.flush_threshold_ops: 50000
In v1.3.x we set flush_threshold_ops
to unbounded. You should do the same.
replication: async
This stops Elasticsearch from exerting backpressure if it is struggling. Use sync
replication.
index.compound_on_flush: false
index.compound_format: false
Why are you disabling the compound format? This should not be necessary, especially if you have an unbounded flush_threshold_ops
threadpool.index.size: 600
You're going to overwhelm Elasticsearch with so many indexing threads. Use the default settings instead.
And back to your original issue - you say the cluster becomes unresponsive, but what do you mean by that? Do no nodes respond at all, even to a simple GET /
request?
There is nothing else at all in any of your other logs?
the logs arent telling me anything for the time when the cluster becomes "unresposive". When im using curl to for eg:
curl -XPUT localhost:9200/_cluster/settings -d '{ "persistent" : { "cluster.routing.allocation.disable_allocation" : true } }'
The cluster responds with a 503 timeout
{"error":"RemoteTransportException[[csliveeubs-elasticsearch01.server.lan][inet[/...:9300]][cluster/settings/update]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (cluster_update_settings) within 30s]; ","status":503}
How big is your cluster state? What's the output of:
curl 'localhost:9200/_cat/pending_tasks?v'
at this moment its freshly restartet to bring it back to a working state asapst because our queue are filling up pretty fast, we need to get them empty again, i will have a look when this happens again.
okay, here is the pending tasks, zen-disco-recive piled up like hell:
@psych0d0g have you tried using the elasticsearch-http connector in logstash?
not yet since my thought was that its not as performant as the node protocol additionally with the node protocol i dont have to care about distributing the load myself accross the cluster
So when a client node joins the cluster, it has to go through "discovery" which can take a little time. The idea is that the client node will be part of the cluster for an extended period. It looks like, for whatever reason, you are either creating lots and lots of connections or they are very short lived.
You probably want to ask about your logstash config on the logstash forums, but I'd definitely try out the http connector, which is the recommended connector these days.
mhm, i just did, and it IS slow, with the node protocoll i managed to index about 20k loglines/s with the http api im down to 12k/s
and the problem might be a combination of both, i use 24 workes for the ES output and i restart the logstashes constantly when introducing new grok filter. That might be the root cause indeed
Perhaps the logstash guys can give you some tuning advice. Btw, with the work that is happening in the improve_zen branch on Elasticsearch, discovery will become much quicker than it is today.
okay, now the issue is back again, even the /_cat/ request seems to timeout right now so i cant check whether my pending tasks increased at least i get a bunch of stacktrace's in my logfiles now: https://gist.github.com/psych0d0g/73a52ca35405236c9406
Here are some more, from yesterday at around the time the last loglines where indexed:
[2014-08-26 20:17:37,892][DEBUG][action.admin.cluster.node.stats] [csliveeubs-elasticsearch01.server.lan] failed to execute on node [MM0GwoWYTSqBWD3VeFiaiw]
org.elasticsearch.transport.RemoteTransportException: [csliveeubap-elasticsearch01.server.lan][inet[/10.76.30.11:9300]][cluster/nodes/stats/n]
Caused by: org.elasticsearch.index.engine.EngineException: [logstash-live-2014.08.26][4] failed to acquire searcher, source doc_stats
at org.elasticsearch.index.engine.internal.InternalEngine.acquireSearcher(InternalEngine.java:713)
at org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:653)
at org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:647)
at org.elasticsearch.index.shard.service.InternalIndexShard.docStats(InternalIndexShard.java:498)
at org.elasticsearch.action.admin.indices.stats.CommonStats.
ES has 12GB of heap space, how the heck can it eat all that precious RAM up and demand even more for its heap?
@psych0d0g I'm guessing that you are sorting or aggregating on some fields which are taking up a lot of memory? Have a look at the output of:
GET '/_nodes/stats/indices/fielddata?human&fields=*'
{"cluster_name":"cslive-elasticsearch","nodes":{"vCAZom1MQF2XZGNqqtqtew":{"timestamp":1409733531784,"name":"csliveeubs-elasticsearch02.server.lan","transport_address":"inet[/10.79.192.12:9300]","host":"csliveeubs-elasticsearch02.server.lan","ip":["inet[/10.79.192.12:9300]","NONE"],"attributes":{"datacenter_location":"bs","master":"true"},"indices":{"fielddata":{"memory_size":"103.7mb","memory_size_in_bytes":108769547,"evictions":0,"fields":{"tags":{"memory_size":"74.6kb","memory_size_in_bytes":76485},"@timestamp":{"memory_size":"103.5mb","memory_size_in_bytes":108570376},"_uid":{"memory_size":"119.8kb","memory_size_in_bytes":122686}}}}},"eMjtqKXER4ilutu9UPwwmw":{"timestamp":1409733531779,"name":"csliveeubap-elasticsearch02.server.lan","transport_address":"inet[/10.76.30.12:9300]","host":"csliveeubap-elasticsearch02.server.lan","ip":["inet[/10.76.30.12:9300]","NONE"],"attributes":{"datacenter_location":"bap","master":"true"},"indices":{"fielddata":{"memory_size":"194.5kb","memory_size_in_bytes":199171,"evictions":0,"fields":{"_uid":{"memory_size":"119.8kb","memory_size_in_bytes":122686},"tags":{"memory_size":"74.6kb","memory_size_in_bytes":76485}}}}},"g2nl3aZJSeqhRZMaUdpbsA":{"timestamp":1409733531783,"name":"csqaeubs-logparser02.server.lan","transport_address":"inet[/10.88.7.52:9300]","host":"csqaeubs-logparser02.server.lan","ip":["inet[/10.88.7.52:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"2NuW6V_LQLieGaRLlFXipQ":{"timestamp":1409733532930,"name":"csliveeubap-elasticsearch01.server.lan","transport_address":"inet[/10.76.30.11:9300]","host":"csliveeubap-elasticsearch01.server.lan","ip":["inet[/10.76.30.11:9300]","NONE"],"attributes":{"datacenter_location":"bap","master":"true"},"indices":{"fielddata":{"memory_size":"96.7mb","memory_size_in_bytes":101478287,"evictions":0,"fields":{"tags":{"memory_size":"149.3kb","memory_size_in_bytes":152936},"@timestamp":{"memory_size":"96.5mb","memory_size_in_bytes":101220616},"_uid":{"memory_size":"102.2kb","memory_size_in_bytes":104735}}}}},"fQIrgXboSWyZ6IgZ19_BPw":{"timestamp":1409733531783,"name":"csqaeubs-logparser01.server.lan","transport_address":"inet[/10.88.7.51:9300]","host":"csqaeubs-logparser01.server.lan","ip":["inet[/10.88.7.51:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"U8TV4j2ETYCuf4Y8HW4zYQ":{"timestamp":1409733531780,"name":"csqaeubap-logparser01.server.lan","transport_address":"inet[/10.88.32.51:9300]","host":"csqaeubap-logparser01.server.lan","ip":["inet[/10.88.32.51:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"XiaE-n4_Rp6ZvZ8mj74D1w":{"timestamp":1409733531780,"name":"csqaeubap-logparser02.server.lan","transport_address":"inet[/10.88.32.52:9300]","host":"csqaeubap-logparser02.server.lan","ip":["inet[/10.88.32.52:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"krlSSP8JSneA6Lqu4xdR-A":{"timestamp":1409733531783,"name":"csliveeubs-elasticsearch01.server.lan","transport_address":"inet[/10.79.192.11:9300]","host":"csliveeubs-elasticsearch01.server.lan","ip":["inet[/10.79.192.11:9300]","NONE"],"attributes":{"datacenter_location":"bs","master":"true"},"indices":{"fielddata":{"memory_size":"235.8mb","memory_size_in_bytes":247358263,"evictions":0,"fields":{"tags":{"memory_size":"149.3kb","memory_size_in_bytes":152936},"@timestamp":{"memory_size":"235.6mb","memory_size_in_bytes":247100592},"_uid":{"memory_size":"102.2kb","memory_size_in_bytes":104735}}}}}}}
OK, so it doesn't look like your fielddata, as that is not using much memory (or did you restart nodes in the interim).
How big are your bulk batches? Are you still using so many indexing threads? Please add the full output of these two commands (in a separate gist):
GET /_nodes
GET /_nodes/stats
It took the cluster several minutes to respond with the requested data
GET /_nodes: https://gist.github.com/psych0d0g/9a19380a5d5241002f55
GET /_nodes/stats: https://gist.github.com/psych0d0g/524c8279e4bc5de161c4
another GET '/_nodes/stats/indices/fielddata?human&fields=*': https://gist.github.com/psych0d0g/1119212c262ef2026ba7
my batches are: 1000 my indexing threads have ben removed from the config so im back to default, im restarting the cluster daily atm to get it back to performance
Hi @psych0d0g
The only things that show up as abnormal are (1) slow garbage collections and (2) the amount of RAM being used for your segments. See indices.segments.memory_in_bytes
in the stats output.
I suggest disabling bloom filters on old indices, and disabling norms (or making fields not_analyzed) on any string field which is not used for full text search.
See:
I'm assuming that this advice worked, as we haven't heard any more, so I'll close this ticket
Hi,
I have been facing this issue for couple of days. Here are the details -
I have a cluster of 7 nodes, with 3 master eligible with no data and 4 only data nodes. I have set minimum master node property to 2. I have used KOPF/ElasticHQ/Head for monitoring. I have set up the cluster for testing so there is hardly any querying to going on. So it remains idle most of the time. After few hours of inactivity, the plugins cannot show the any details except for the green status (no other details like shard etc shown). If I make any search query, it does not work.
e.g. http://ironman:9200/_search?q=chrome
I have to restart the master's elasticsearch service to make it responsive again. I checked the $ tail -f /var/log/elasticsearch/avengers.log I don't see any failure/exceptions. I had elastic search 1.1 and updated to 1.2 as well. But I am still having this issue.
Please let me know if you need more details.
Thanks, Nilesh.